A significant portion of a retail bank's profits is sourced from home loans. These loans are borrowed by regular income/high-earning customers. The threat posed by defaulters is a major concern for banks, as non-performing loans (NPLs) can have a significant impact on profits. Therefore, it is important that banks are judicious while approving loans for their customer base.
The loan approval process is complex and involves multiple facets. In this process, banks undertake a detailed manual examination of various elements of the loan application to assess the applicant's creditworthiness. This procedure is not only laborious but also susceptible to incorrect judgments or approvals, largely due to the potential for human errors and biases.
Numerous banks have previously tried to automate the loan approval process through heuristic methods. However, with the emergence of data science and machine learning technologies, there's a growing trend towards developing systems capable of learning and optimizing this process. These advanced systems aim to eliminate biases and enhance efficiency. A critical consideration in this shift is ensuring that these automated systems do not inadvertently adopt any of the biases that may have been present in the traditional, human-driven approval processes.
Develop a predictive classification model designed to identify clients at risk of loan default and provide the bank with insights on key factors to be considered during the loan approval process.
The objective is to leverage data science to create a predictive model capable of accurately identifying potential loan defaulters. This model will assist banks in making well-informed and risk-aware lending decisions while adhering to the guidelines of the Equal Credit Opportunity Act. Compliance with this act necessitates interpretability and empirical grounding in order to warrant transparency and objectivity in the lending process.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
!pip install scikeras
!pip install shap
# Libraries for data manipulation and analysis
import pandas as pd
import numpy as np
# Library for splitting data
from sklearn.model_selection import train_test_split
# Libraries for data visualization and output analysis
import matplotlib.pyplot as plt
import seaborn as sns
import shap as sh
# Libraries for machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
# Libraries for building and training neural network models
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Activation
from tensorflow.keras import backend as K
from scikeras.wrappers import KerasClassifier
from sklearn.utils import class_weight
from tensorflow.keras.optimizers import Adamax
from tensorflow.keras.optimizers import Adam
# Library for statistical calculations
import scipy.stats as stats
# Library to help with feature engineering
from sklearn.preprocessing import PolynomialFeatures
# Library to help with multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
# To calcutare Pearson's correlation coefficient
from scipy.stats import pearsonr
# Library for generating pseudo-random numbers
import random
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Libraries for evaluating model performance
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
classification_report,
precision_recall_curve,
make_scorer,
)
# Libraries for imbalanced dataset handling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# Set random seeds for reproducibility
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
# Library for model serialization
import joblib
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')
# Reading the data
original_data = pd.read_csv('/content/drive/MyDrive/Week 10/hmeq.csv')
# Copying the data to a different variable to prevent modifications to the original data
data = original_data.copy()
# Displaying the first five rows
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# Displaying the last five rows
data.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
# Checking the shape of the data
data.shape
(5960, 13)
Observation:
The dataset has 5,960 rows and 13 columns.
# Checking the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observations:
# Generating summary statistics for all columns
data.describe(include='all')
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5960.000000 | 5960.000000 | 5442.000000 | 5848.000000 | 5708 | 5681 | 5445.000000 | 5252.000000 | 5380.000000 | 5652.000000 | 5450.000000 | 5738.000000 | 4693.000000 |
| unique | NaN | NaN | NaN | NaN | 2 | 6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | NaN | NaN | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | 3928 | 2388 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 0.199497 | 18607.969799 | 73760.817200 | 101776.048741 | NaN | NaN | 8.922268 | 0.254570 | 0.449442 | 179.766275 | 1.186055 | 21.296096 | 33.779915 |
| std | 0.399656 | 11207.480417 | 44457.609458 | 57385.775334 | NaN | NaN | 7.573982 | 0.846047 | 1.127266 | 85.810092 | 1.728675 | 10.138933 | 8.601746 |
| min | 0.000000 | 1100.000000 | 2063.000000 | 8000.000000 | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.524499 |
| 25% | 0.000000 | 11100.000000 | 46276.000000 | 66075.500000 | NaN | NaN | 3.000000 | 0.000000 | 0.000000 | 115.116702 | 0.000000 | 15.000000 | 29.140031 |
| 50% | 0.000000 | 16300.000000 | 65019.000000 | 89235.500000 | NaN | NaN | 7.000000 | 0.000000 | 0.000000 | 173.466667 | 1.000000 | 20.000000 | 34.818262 |
| 75% | 0.000000 | 23300.000000 | 91488.000000 | 119824.250000 | NaN | NaN | 13.000000 | 0.000000 | 0.000000 | 231.562278 | 2.000000 | 26.000000 | 39.003141 |
| max | 1.000000 | 89900.000000 | 399550.000000 | 855909.000000 | NaN | NaN | 41.000000 | 10.000000 | 15.000000 | 1168.233561 | 17.000000 | 71.000000 | 203.312149 |
Observations:
# Checking duplicate entries
data.duplicated ().sum()
0
Observation:
There are no duplicate entries in the data.
# Checking missing values
data.isnull ().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
Observation:
As many as 11 columns have missing values, ranging between 112 missing values for MORTDUE and 1,267 for DEBTINC. Only LOAN and BAD have no missing values.
# Checking missing values as percentage for each column
data.isnull ().sum()/data.shape [0] * 100
BAD 0.000000 LOAN 0.000000 MORTDUE 8.691275 VALUE 1.879195 REASON 4.228188 JOB 4.681208 YOJ 8.640940 DEROG 11.879195 DELINQ 9.731544 CLAGE 5.167785 NINQ 8.557047 CLNO 3.724832 DEBTINC 21.258389 dtype: float64
Observations:
# Checking the data subset containing missing values in DEBTINC
nan_debtinc = data[pd.isna(data['DEBTINC'])]
nan_debtinc
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5930 | 1 | 72300 | NaN | 85000.0 | DebtCon | Other | 1.0 | 0.0 | 0.0 | 117.166667 | 9.0 | 23.0 | NaN |
| 5932 | 1 | 76500 | 38206.0 | 90000.0 | DebtCon | Other | 12.0 | 0.0 | 0.0 | 134.900000 | 0.0 | 26.0 | NaN |
| 5933 | 1 | 77200 | 83962.0 | 215000.0 | HomeImp | Self | 8.0 | 1.0 | 2.0 | 71.533132 | 3.0 | 14.0 | NaN |
| 5935 | 0 | 78400 | 13900.0 | 102910.0 | HomeImp | NaN | 27.0 | 0.0 | 1.0 | 138.000000 | 0.0 | 14.0 | NaN |
| 5948 | 0 | 86000 | 47355.0 | 85000.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 210.966667 | 0.0 | 16.0 | NaN |
1267 rows × 13 columns
Observation:
The majority of entries with missing data in the DEBTINC column also seem to default.
# Checking total missing values for DEBTINC
total_nan_debtinc = nan_debtinc.shape[0]
total_nan_debtinc
1267
# Checking percentage of missing values in DEBTINC that default
bad_nan_debtinc = nan_debtinc [nan_debtinc ['BAD'] == 1].shape[0] / total_nan_debtinc * 100
bad_nan_debtinc
62.036306235201266
Observations:
Possible approaches to handling missing values for DEBTINC:
# Checking the data subset containing missing values in DEROG
nan_derog = data[pd.isna(data['DEROG'])]
nan_derog
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 1 | 2000 | 22608.0 | NaN | NaN | NaN | 18.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | 1 | 2200 | 23030.0 | NaN | NaN | NaN | 19.0 | NaN | NaN | NaN | NaN | NaN | 3.711312 |
| 23 | 1 | 2400 | 18000.0 | NaN | HomeImp | Mgr | 22.0 | NaN | 2.0 | 121.733333 | 0.0 | 10.0 | NaN |
| 48 | 0 | 3000 | 58000.0 | 71500.0 | HomeImp | Mgr | 10.0 | NaN | 2.0 | 211.933333 | 0.0 | 25.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5735 | 0 | 42900 | 101731.0 | 155688.0 | DebtCon | Other | 14.0 | NaN | 1.0 | 235.352057 | 3.0 | 30.0 | 43.181424 |
| 5763 | 0 | 44400 | 100564.0 | 154708.0 | DebtCon | Other | 14.0 | NaN | 0.0 | 253.914111 | 3.0 | 30.0 | 43.902152 |
| 5766 | 0 | 44900 | 95410.0 | 157649.0 | DebtCon | Other | 16.0 | NaN | 1.0 | 216.101046 | 4.0 | 31.0 | 40.483515 |
| 5820 | 1 | 50000 | 80286.0 | 145000.0 | DebtCon | Other | 12.0 | NaN | 1.0 | 178.766599 | 0.0 | 35.0 | NaN |
| 5882 | 1 | 53800 | 82324.0 | 154409.0 | DebtCon | Other | 10.0 | NaN | 1.0 | 157.484011 | 0.0 | 34.0 | 41.659044 |
708 rows × 13 columns
# Checking total missing values for DEROG
total_nan_derog = nan_derog.shape[0]
total_nan_derog
708
# Checking percentage of missing values in DEROG that default
bad_nan_derog = nan_derog [nan_derog ['BAD'] == 1].shape[0] / total_nan_derog * 100
bad_nan_derog
12.288135593220339
Observation:
# Creating a DataFrame to indicate where values are missing
missing_values_df = data.isnull()
# Dropping the columns BAD and LOAN as these have no missing values
missing_values_df = missing_values_df.drop(columns=['BAD', 'LOAN'])
# Computing the correlation matrix for missing values
missing_corr = missing_values_df.corr()
# Creating a heatmap to visualize the correlations
plt.figure(figsize=(12, 8))
sns.heatmap(missing_corr, annot=True, cmap='coolwarm', cbar=True, linewidths=0.5, linecolor='white')
plt.title('Correlation of Missing Values (Excluding BAD and LOAN)')
plt.show()
Observations:
# Checking the relationship between missing values in CLNO and missing values in other columns
nan_clno = data[pd.isna(data['CLNO'])]
nan_clno
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 1 | 2000 | 22608.0 | NaN | NaN | NaN | 18.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | 1 | 2200 | 23030.0 | NaN | NaN | NaN | 19.0 | NaN | NaN | NaN | NaN | NaN | 3.711312 |
| 51 | 0 | 3100 | NaN | 70400.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | 1 | 3600 | 61584.0 | 61800.0 | HomeImp | ProfExe | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4680 | 0 | 24600 | NaN | 146804.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 17.263535 |
| 4789 | 0 | 25100 | 85337.0 | 104607.0 | HomeImp | NaN | 6.0 | NaN | NaN | NaN | NaN | NaN | 27.950475 |
| 4880 | 0 | 25600 | NaN | 147598.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14.461987 |
| 4899 | 0 | 25700 | 85417.0 | 98179.0 | HomeImp | NaN | 7.0 | NaN | NaN | NaN | NaN | NaN | 30.829477 |
| 4947 | 0 | 26100 | NaN | 151429.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15.567001 |
222 rows × 13 columns
Observation:
As expected, there are customers with missing values across the same columns (DEROG, DELINQ, CLAGE, NINQ, CLNO). The pattern is not the same as with DEBTINC, confirming that this column may contain additional information.
# Creating a variable with columns of object type
cols = data.select_dtypes(['object']).columns.tolist()
# Adding target variable to this list as this is a classification problem and the target variable is categorical
cols.append('BAD')
# Checking the cols list
cols
['REASON', 'JOB', 'BAD']
# Changing the data type for the seleced columns to gain memory efficiency
for i in cols:
data[i] = data[i].astype('category')
# Checking the data info
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null category 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null category 5 JOB 5681 non-null category 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: category(3), float64(9), int64(1) memory usage: 483.7 KB
Observation:
The columns BAD, REASON, and JOB are now of category data type.
# Checking summary of categorical data
data.describe(include=['category'])
| BAD | REASON | JOB | |
|---|---|---|---|
| count | 5960 | 5708 | 5681 |
| unique | 2 | 2 | 6 |
| top | 0 | DebtCon | Other |
| freq | 4771 | 3928 | 2388 |
Observations:
# Checking the percentage of unique values in categorical columns
# Creating a list with the categorical columns
categ_cols = data.select_dtypes(['category'])
# Displaying percentage of unique values
for i in categ_cols.columns:
print('Unique values in', i, '(percent):')
print(data[i].value_counts(1))
print('-' * 35)
Unique values in BAD (percent): 0 0.800503 1 0.199497 Name: BAD, dtype: float64 ----------------------------------- Unique values in REASON (percent): DebtCon 0.688157 HomeImp 0.311843 Name: REASON, dtype: float64 ----------------------------------- Unique values in JOB (percent): Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: JOB, dtype: float64 -----------------------------------
Observations:
Leading Questions:
# Function for generating a boxplot and histogram
def boxplot_histogram (feature, figsize=(15, 10), bins=None, kde=False,
hist_color='steelblue', mean_color='green',
median_color='black', box_color='tan'):
'''
Boxplot and histogram combined.
feature: 1-d feature array.
figsize: size of fig (default (15,10)).
bins: number of bins (default None / auto).
kde: Bool, whether to plot a gaussian kernel density estimate (default False).
hist_color: Color for histogram bins (default 'steelblue').
mean_color: Color for mean line in histogram (default 'green').
median_color: Color for median line in histogram (default 'black').
box_color: Color for the boxplot (default 'tan').
'''
f2, (ax_box, ax_hist) = plt.subplots(nrows=2, # Number of rows of subplot grid=2
sharex=True, # x-axis shared among subplots
gridspec_kw={'height_ratios': (.25, .75)},
figsize=figsize)
# Boxplot
sns.boxplot(x=feature, ax=ax_box, showmeans=True, color=box_color)
# Histogram
if bins:
sns.histplot(feature, kde=kde, ax=ax_hist, bins=bins, color=hist_color)
else:
sns.histplot(feature, kde=kde, ax=ax_hist, color=hist_color)
# Mean and Median lines for histogram
mean_value = np.mean(feature)
median_value = np.median(feature)
ax_hist.axvline(mean_value, color=mean_color, linestyle='--', label='Mean')
# Check if median is valid (not NaN) and plot
if not np.isnan(median_value):
ax_hist.axvline(median_value, color=median_color, linestyle='-', label='Median')
# Adding legend only if median is valid
ax_hist.legend()
# Calling boxplot_histogram function for LOAN
boxplot_histogram (data['LOAN'])
Observations:
# Calling boxplot_histogram function for MORTDUE
boxplot_histogram (data['MORTDUE'])
Observations:
# Calling boxplot_histogram function for VALUE
boxplot_histogram (data['VALUE'])
Observations:
# Calling boxplot_histogram function for YOJ
boxplot_histogram (data['YOJ'])
Observations:
# Calling boxplot_histogram function for DEROG
boxplot_histogram (data['DEROG'])
Observations:
# Calling boxplot_histogram function for DELINQ
boxplot_histogram (data['DELINQ'])
Observations:
# Calling boxplot_histogram function for CLAGE
boxplot_histogram (data['CLAGE'])
Observations:
# Calling boxplot_histogram function for NINQ
boxplot_histogram (data['NINQ'])
Observation:
# Calling boxplot_histogram function for CLNO
boxplot_histogram (data['CLNO'])
Observations:
# Calling boxplot_histogram function for DEBTINC
boxplot_histogram (data['DEBTINC'])
Observations:
# Function to create barplots with percentage labels for each category
def perc_on_bar(plot, feature):
'''
plot: The bar plot axis object from seaborn or matplotlib.
feature: The categorical feature used to create the bar plot.
Note: This function won't work if a column is passed in the hue parameter.
'''
total = len(feature) # Total number of data points in the feature
for p in plot.patches:
percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Calculate the percentage
x = p.get_x() + p.get_width() / 2 # Center of the bar
y = p.get_y() + p.get_height() # Top of the bar
plot.annotate(percentage, (x, y), ha='center', va='center', size=12, xytext=(0, 5), textcoords='offset points')
# Set the font size of x-axis labels
plot.set_xticklabels(plot.get_xticklabels(), fontsize=12)
# Checking the count and percentage of JOB
# Ordering the categories based on frequency for JOB
order = data['JOB'].value_counts().index
# Creating the countplot with the specified order
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='JOB', data=data, palette='tab10', order=order)
# Adding percentage label
perc_on_bar(ax, data['JOB'])
plt.show()
Observations:
# Checking the count and percentage of REASON
# Ordering the categories based on frequency for REASON
order = data['REASON'].value_counts().index
# Creating the countplot with the specified order
plt.figure(figsize=(3, 5))
ax = sns.countplot(x='REASON', data=data, palette='tab10', order=order)
# Adding percentage label
perc_on_bar(ax, data['REASON'])
plt.show()
Observation:
# Checking the count and percentage of BAD
# Ordering the categories based on frequency for BAD
order = data['BAD'].value_counts().index
# Creating the countplot with the specified order
plt.figure(figsize=(3, 5))
ax = sns.countplot(x='BAD', data=data, palette='tab10', order=order)
# Adding percentage label
perc_on_bar(ax, data['BAD'])
plt.show()
Observation:
# Creating a pie chart to visualize the percentage of customer default
print(data.BAD.value_counts())
labels = 'Default', 'Do Not Default'
sizes = [data.BAD[data['BAD']==1].count(), data.BAD[data['BAD']==0].count()]
explode = (0, 0.1) # only 'explode' the 1st slice (i.e., 'Default')
colors = ['#B0E0E6', '#BC8F8F']
fig1, ax1 = plt.subplots(figsize=(8, 6))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90, colors=colors)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Proportion of Defaults', size=20)
plt.show()
0 4771 1 1189 Name: BAD, dtype: int64
Observation:
# Function for generating histograms and boxplots to visualize the distribution of a predictor variable by target classes
def distribution_plot_wrt_target(data, predictor, target):
# Extracting the unique values of the target variable
target_uniq = sorted(data[target].unique()) # Sort to ensure the order [0, 1]
# Creating a 2x2 subplot
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
# First row: Histograms for each target value
for i in range(2):
axs[0, i].set_title(f'Histogram of {predictor} for {target} = {target_uniq[i]}')
sns.histplot(
data=data[data[target] == target_uniq[i]],
x=predictor,
kde=True,
ax=axs[0, i],
color='teal' if i == 0 else 'orange',
)
# Second row: Boxplots
axs[1, 0].set_title(f'Boxplot of {predictor} by {target}')
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette='Set2', order=[0, 1])
axs[1, 1].set_title(f'Boxplot (without outliers) of {predictor} by {target}')
sns.boxplot(
data=data, x=target, y=predictor, ax=axs[1, 1], showfliers=False, palette='Set2', order=[0, 1]
)
plt.tight_layout()
# Calling the distribution_plot_wrt_target function for LOAN
distribution_plot_wrt_target(data, 'LOAN', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for MORTDUE
distribution_plot_wrt_target(data, 'MORTDUE', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for VALUE
distribution_plot_wrt_target(data, 'VALUE', 'BAD')
Observation:
# Calling the distribution_plot_wrt_target function for YOJ
distribution_plot_wrt_target(data, 'YOJ', 'BAD')
Observation:
# Calling the distribution_plot_wrt_target function for DEROG
distribution_plot_wrt_target(data, 'DEROG', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for DELINQ
distribution_plot_wrt_target(data, 'DELINQ', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for CLAGE
distribution_plot_wrt_target(data, 'CLAGE', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for NINQ
distribution_plot_wrt_target(data, 'NINQ', 'BAD')
Observations:
# Calling the distribution_plot_wrt_target function for CLNO
distribution_plot_wrt_target(data, 'CLNO', 'BAD')
Observation:
The CLNO data has a similar distribution among defaulters and non-defaulters. Non-defaulters have a few more outliers.
# Calling the distribution_plot_wrt_target function for DEBTINC
distribution_plot_wrt_target(data, 'DEBTINC', 'BAD')
Observations:
# Checking for the correlation between VALUE and MORTDUE
sns.scatterplot(x=data['VALUE'], y=data['MORTDUE'], palette='PuBu')
plt.title('Scatter Plot of MORTDUE vs VALUE')
plt.show()
Observations:
# Checking for the correlation between LOAN and VALUE
sns.scatterplot(x=data['LOAN'], y=data['VALUE'], palette='PuBu')
plt.title('Scatter Plot of LOAN vs VALUE')
plt.show()
Observations:
# Checking for the correlation between CLNO and CLAGE
sns.scatterplot(x=data['CLNO'], y=data['CLAGE'], palette='PuBu')
plt.title('Scatter Plot of CLNO vs CLAGE')
plt.show()
Observation:
# Checking for the correlation between MORTDUE and CLNO
sns.scatterplot(x=data['MORTDUE'], y=data['CLNO'], palette='PuBu')
plt.title('Scatter Plot of MORTDUE vs CLNO')
plt.show()
Observation:
There is a weak positive correlation between MORTDUE and CLNO. This indicates that a higher mortgage is associated with higher number of existing credit lines, but the relationship is not strong.
# Checking for the correlation between CLNO and VALUE
sns.scatterplot(x=data['CLNO'], y=data['VALUE'], palette='PuBu')
plt.title('Scatter Plot of CLNO vs VALUE')
plt.show()
Observations:
# Checking for the correlation between DELINQ and LOAN
sns.scatterplot(x=data['DELINQ'], y=data['LOAN'], palette='PuBu')
plt.title('Scatter Plot of DELINQ vs LOAN')
plt.show()
Observations:
# Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
with sns.axes_style('darkgrid'):
# Calculate and display the crosstab values
tab1 = pd.crosstab(x, data['BAD'], margins=True)
print(tab1)
print('-' * 120)
# Define 'tab' as the normalized crosstab
tab = pd.crosstab(x, data['BAD'], normalize='index')
# Calculate and plot each category
ax = tab.plot(kind='bar', stacked=True, figsize=(10, 5), color=['steelblue', 'lightcoral'])
# Set the legend
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
# Set x-tick labels to be horizontal
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
plt.show()
# Calling stacked_plot function for REASON
stacked_plot(data['REASON'])
BAD 0 1 All REASON DebtCon 3183 745 3928 HomeImp 1384 396 1780 All 4567 1141 5708 ------------------------------------------------------------------------------------------------------------------------
Observation:
The likelihood of default is slightly higher among loans for home improvement than among loans for debt consolidation.
# Calling stacked_plot function for JOB
stacked_plot(data['JOB'])
BAD 0 1 All JOB Mgr 588 179 767 Office 823 125 948 Other 1834 554 2388 ProfExe 1064 212 1276 Sales 71 38 109 Self 135 58 193 All 4515 1166 5681 ------------------------------------------------------------------------------------------------------------------------
Observations:
# Selecting numerical columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
# Creating heatmap to visualize correlation matrix of the selected columns
plt.figure(figsize=(12, 7))
sns.heatmap(
data[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='RdBu', linewidths=0.5, linecolor='white'
)
plt.show()
Observations:
# Displaying pairplot
sns.pairplot (data, hue='BAD')
plt.show()
Observations:
Considering the outliers in the dataset, the IQR method is a reasonable approach for treatment, preserving dataset integrity while mitigating outlier effects.
# Function for treating outliers with the IQR method
def treat_outliers (df, col):
'''
Treats outliers in a variable.
df: the dataframe.
col: str, name of the numerical variable.
'''
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # IQR Range
Lower_Whisker = Q1 - 1.5 * IQR # Define lower whisker
Upper_Whisker = Q3 + 1.5 * IQR # Define upper whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
# All values smaller than Lower_Whisker are assigned the value of Lower_Whisker
# All values above Upper_Whisker are assigned the value of Upper_Whisker
return df
def treat_outliers_all (df, col_list):
'''
Treat outliers in all numerical variables.
df: the dataframe.
col_list: list, list of numerical variables.
'''
for c in col_list:
df = treat_outliers(df, c)
return df
# Copying the data to a different variable to avoid making changes to the original data
df_raw = data.copy()
# Selecting numerical columns
num_cols = df_raw.select_dtypes(include=np.number).columns.tolist()
# Creating a dataframe with treated outliers
df = treat_outliers_all (df_raw, num_cols)
# Checking the dataframe first 5 rows
df.describe()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 5960.000000 | 5442.000000 | 5848.000000 | 5445.000000 | 5252.0 | 5380.0 | 5652.000000 | 5450.000000 | 5738.000000 | 4693.000000 |
| mean | 18051.895973 | 71566.093752 | 98538.057633 | 8.873159 | 0.0 | 0.0 | 178.635811 | 1.093394 | 21.032851 | 33.681973 |
| std | 9252.565294 | 37203.654400 | 45070.800236 | 7.430914 | 0.0 | 0.0 | 80.495471 | 1.372692 | 9.420239 | 7.135236 |
| min | 1100.000000 | 2063.000000 | 8000.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 14.345367 |
| 25% | 11100.000000 | 46276.000000 | 66075.500000 | 3.000000 | 0.0 | 0.0 | 115.116702 | 0.000000 | 15.000000 | 29.140031 |
| 50% | 16300.000000 | 65019.000000 | 89235.500000 | 7.000000 | 0.0 | 0.0 | 173.466667 | 1.000000 | 20.000000 | 34.818262 |
| 75% | 23300.000000 | 91488.000000 | 119824.250000 | 13.000000 | 0.0 | 0.0 | 231.562278 | 2.000000 | 26.000000 | 39.003141 |
| max | 41600.000000 | 159306.000000 | 200447.375000 | 28.000000 | 0.0 | 0.0 | 406.230642 | 5.000000 | 42.500000 | 53.797805 |
Observation:
Additionally, the dataset has missing values in many columns, which can be treated with the median for numerical variables and the mode for categorical variables. As missingness seems correlated with BAD in some columns, notably in the case of DEBTINC, we will create an accompanying binary flag for columns that have missing values that will allow the model to capture any such correlations.
# For each column we create a binary flag for the row, if there is missing value in the row, then 1 else 0.
def add_binary_flag(df,col):
'''
df: dataframe
col: column which has missing values
It returns a dataframe which has binary flag for missing values in column col
'''
new_col = str(col)
new_col += '_missing_values_flag'
df[new_col] = df[col].isna()
return df
# List of columns that have missing values
missing_col = [col for col in df.columns if df[col].isnull().any()]
# Add flag for each column with missing values
for colmn in missing_col:
add_binary_flag(df, colmn)
# Selecting numeric columns
num_data = df.select_dtypes('number')
# Selecting categorical columns
cat_data = df.select_dtypes('category').columns.tolist()
# Filling numeric columns with median
df[num_data.columns] = num_data.fillna(num_data.median())
# Filling categorical columns with mode
for column in cat_data:
mode = df[column].mode()[0]
df[column] = df[column].fillna(mode)
# Checking the dataframe first 5 rows
df.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | ... | VALUE_missing_values_flag | REASON_missing_values_flag | JOB_missing_values_flag | YOJ_missing_values_flag | DEROG_missing_values_flag | DELINQ_missing_values_flag | CLAGE_missing_values_flag | NINQ_missing_values_flag | CLNO_missing_values_flag | DEBTINC_missing_values_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | ... | False | False | False | False | False | False | False | False | False | True |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 0.0 | 121.833333 | ... | False | False | False | False | False | False | False | False | False | True |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | ... | False | False | False | False | False | False | False | False | False | True |
| 3 | 1 | 1500 | 65019.0 | 89235.5 | DebtCon | Other | 7.0 | 0.0 | 0.0 | 173.466667 | ... | True | True | True | True | True | True | True | True | True | True |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | ... | False | False | False | False | False | False | False | False | False | True |
5 rows × 24 columns
Observation:
What are the most important observations and insights from the data based on the EDA performed?
Data Description:
Observations from EDA:
Data Cleaning:
# Checking for correlation between VALUE and BAD
# Calculating the Pearson Correlation Coefficient
correlation, p_value = pearsonr(df['VALUE'], df['BAD'])
# Print the correlation
print('Pearson Correlation Coefficient:', correlation)
# Interpret the result
if correlation < -0.5:
print("There is a strong negative correlation.")
elif correlation > 0.5:
print("There is a strong positive correlation.")
elif correlation < 0:
print("There is a weak negative correlation.")
elif correlation > 0:
print("There is a weak positive correlation.")
else:
print("There is no significant correlation.")
Pearson Correlation Coefficient: -0.07112411230014708 There is a weak negative correlation.
# Checking for correlation between LOAN and BAD
# Calculate the Pearson Correlation Coefficient
correlation, p_value = pearsonr(df['LOAN'], df['BAD'])
# Print the correlation
print('Pearson Correlation Coefficient:', correlation)
# Interpret the result
if correlation < -0.5:
print("There is a strong negative correlation.")
elif correlation > 0.5:
print("There is a strong positive correlation.")
elif correlation < 0:
print("There is a weak negative correlation.")
elif correlation > 0:
print("There is a weak positive correlation.")
else:
print("There is no significant correlation.")
Pearson Correlation Coefficient: -0.08502693424025151 There is a weak negative correlation.
What is the range of values for the loan amount variable "LOAN"?
How does the distribution of years at present job "YOJ" vary across the dataset?
How many unique categories are there in the REASON variable?
What is the most common category in the JOB variable?
Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
Do applicants who default have a significantly different loan amount compared to those who repay their loan?
Is there a correlation between the value of the property and the loan default rate?
Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?
Note: these questions have been addressed as part of EDA.
Separating the dependent variable (Y) and the independent variables (X)
# Separating the target variable and other variables
Y = df.BAD
X = df.drop(columns=['BAD'], axis=1)
Creating dummy variables for categorical variables
# Creating the list of columns for which we need to create the dummy variables
to_get_dummies_for = ['REASON', 'JOB']
# Creating dummy variables
X = pd.get_dummies(X, columns=to_get_dummies_for, drop_first=True)
Splitting the data into 70% train and 30% test sets
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)
Scaling the data
# Applying StandardScaler to normalize feature scales
sc = StandardScaler()
# Scaling the training data: fit_transform to learn the scaling parameters and apply them to the training data
X_train_scaled = sc.fit_transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
# Scaling the test data: transform using the scaling parameters learned from the training data to prevent data leakage
X_test_scaled = sc.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
The model can make two types of wrong predictions:
Which case is more important?
Predicting that the customer will not default but the customer defaults, which can have large negative impact on the bank's earnings. This would be considered a major risk and is hence the more important case of wrong predictions.
How to reduce this loss i.e. the need to reduce False Negatives?
The bank would want the Recall to be maximized, the greater the Recall, the higher the chances of minimizing false negatives. Thus, the focus should be on increasing the Recall (minimizing the false negatives) or, in other words, identifying the true positives (i.e. Class 1) very well, so that the bank can avoid losses from defaults.
Creating a function to calculate and display the classification report and confusion matrix so that we can apply on each model without having to repeatedly rewrite the same code.
# Function for generating classification report and confusion matrix
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Defaulted', 'Defaulted'], yticklabels=['Not Defaulted', 'Defaulted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
We will be building different models:
# Define the logistic regression model
lg = LogisticRegression()
# Fit the logistic regression model
lg.fit(X_train_scaled, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Checking performance on the training data
y_pred_train_lg = lg.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_lg)
precision recall f1-score support
0 0.91 0.94 0.92 3340
1 0.71 0.61 0.65 832
accuracy 0.87 4172
macro avg 0.81 0.77 0.79 4172
weighted avg 0.87 0.87 0.87 4172
Observation:
The recall for class 1 on the train data is modest, suggesting there is room for improving the model's performance.
# Checking performance on the test data
y_pred_test_lg = lg.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_lg)
precision recall f1-score support
0 0.90 0.95 0.93 1431
1 0.74 0.59 0.66 357
accuracy 0.88 1788
macro avg 0.82 0.77 0.79 1788
weighted avg 0.87 0.88 0.87 1788
Observations:
Let's check the coefficients and find which variables are leading to default and which can help to prevent the default.
# Printing the coefficients of logistic regression
cols = X.columns
coef_lg = lg.coef_
pd.DataFrame(coef_lg, columns=cols).T.sort_values(by=0, ascending=False)
| 0 | |
|---|---|
| DEBTINC_missing_values_flag | 1.161135 |
| VALUE_missing_values_flag | 0.725791 |
| DEBTINC | 0.688505 |
| CLNO_missing_values_flag | 0.324244 |
| CLAGE_missing_values_flag | 0.248795 |
| NINQ | 0.207118 |
| MORTDUE_missing_values_flag | 0.153564 |
| JOB_Sales | 0.114494 |
| REASON_HomeImp | 0.113800 |
| VALUE | 0.097755 |
| CLNO | 0.066866 |
| JOB_Self | 0.059974 |
| NINQ_missing_values_flag | 0.030234 |
| REASON_missing_values_flag | 0.010518 |
| DELINQ | 0.000000 |
| DEROG | 0.000000 |
| JOB_Other | -0.018783 |
| LOAN | -0.072913 |
| JOB_ProfExe | -0.144529 |
| YOJ | -0.159696 |
| MORTDUE | -0.188947 |
| JOB_Office | -0.195640 |
| YOJ_missing_values_flag | -0.234868 |
| DEROG_missing_values_flag | -0.291030 |
| DELINQ_missing_values_flag | -0.407731 |
| JOB_missing_values_flag | -0.441568 |
| CLAGE | -0.461858 |
Observations:
Features which positively affect on the default rate:
Features with no influence on default rate:
Features which negatively affect on the default rate are:
Main insights:
The coefficients of the logistic regression model provides the log of odds, which is hard to interpret in the real world. We can convert the log of odds into odds by taking its exponential.
# Calculating the odds based on logistic regression coefficients
odds = np.exp(lg.coef_[0])
# Adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train_scaled.columns, columns=['odds']).sort_values(by='odds', ascending=False)
| odds | |
|---|---|
| DEBTINC_missing_values_flag | 3.193556 |
| VALUE_missing_values_flag | 2.066365 |
| DEBTINC | 1.990736 |
| CLNO_missing_values_flag | 1.382984 |
| CLAGE_missing_values_flag | 1.282479 |
| NINQ | 1.230128 |
| MORTDUE_missing_values_flag | 1.165982 |
| JOB_Sales | 1.121306 |
| REASON_HomeImp | 1.120528 |
| VALUE | 1.102693 |
| CLNO | 1.069152 |
| JOB_Self | 1.061809 |
| NINQ_missing_values_flag | 1.030696 |
| REASON_missing_values_flag | 1.010574 |
| DELINQ | 1.000000 |
| DEROG | 1.000000 |
| JOB_Other | 0.981392 |
| LOAN | 0.929682 |
| JOB_ProfExe | 0.865430 |
| YOJ | 0.852403 |
| MORTDUE | 0.827831 |
| JOB_Office | 0.822308 |
| YOJ_missing_values_flag | 0.790675 |
| DEROG_missing_values_flag | 0.747493 |
| DELINQ_missing_values_flag | 0.665158 |
| JOB_missing_values_flag | 0.643027 |
| CLAGE | 0.630112 |
Observations:
# Creating a DataFrame for the scaled features
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
# Calculating VIF for each feature
vif = pd.DataFrame()
vif['Feature'] = X_train_scaled_df.columns
vif['VIF'] = [variance_inflation_factor(X_train_scaled_df.values, i) for i in range(X_train_scaled_df.shape[1])]
# Displaying the VIF values
print(vif)
Feature VIF 0 LOAN 1.341691 1 MORTDUE 4.026377 2 VALUE 4.394969 3 YOJ 1.096599 4 DEROG NaN 5 DELINQ NaN 6 CLAGE 1.192714 7 NINQ 1.133478 8 CLNO 1.405307 9 DEBTINC 1.193907 10 MORTDUE_missing_values_flag 1.534863 11 VALUE_missing_values_flag 1.047844 12 REASON_missing_values_flag 1.329240 13 JOB_missing_values_flag 1.517599 14 YOJ_missing_values_flag 1.315230 15 DEROG_missing_values_flag 2.564998 16 DELINQ_missing_values_flag 3.587919 17 CLAGE_missing_values_flag 3.657820 18 NINQ_missing_values_flag 2.799331 19 CLNO_missing_values_flag 4.535180 20 DEBTINC_missing_values_flag 1.098773 21 REASON_HomeImp 1.175721 22 JOB_Office 1.946802 23 JOB_Other 2.745345 24 JOB_ProfExe 2.225169 25 JOB_Sales 1.162261 26 JOB_Self 1.311262
Observation:
The VIF for MORTDUE and VALUE are particularly high, as expected from EDA, although they are less than 5, which is a commonly used threshold. Thus, we will keep both features, as multicollinearity is not significant.
# Selecting features with coefficients equal or greater than 0.15
# Extract coefficients
coefficients = lg.coef_[0]
features = X_train.columns # Feature names from the original dataframe
# Select threshold and apply it to feature selection
threshold = 0.15
selected_features = features[np.abs(coefficients) >= threshold]
# Print selected features
print('Selected Features based on threshold 0.15:')
print(selected_features)
Selected Features based on threshold 0.15:
Index(['MORTDUE', 'YOJ', 'CLAGE', 'NINQ', 'DEBTINC',
'MORTDUE_missing_values_flag', 'VALUE_missing_values_flag',
'JOB_missing_values_flag', 'YOJ_missing_values_flag',
'DEROG_missing_values_flag', 'DELINQ_missing_values_flag',
'CLAGE_missing_values_flag', 'CLNO_missing_values_flag',
'DEBTINC_missing_values_flag', 'JOB_Office'],
dtype='object')
# Scaling the selected data
scaler = StandardScaler()
X_train_selected_scaled = scaler.fit_transform(X_train[selected_features])
X_test_selected_scaled = scaler.transform(X_test[selected_features])
# Fit logistic regression with selected features
lg_selected = LogisticRegression()
lg_selected.fit(X_train_selected_scaled, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Checking performance on the training data
y_pred_train_selected = lg_selected.predict(X_train_selected_scaled)
metrics_score(y_train, y_pred_train_selected)
precision recall f1-score support
0 0.91 0.94 0.92 3340
1 0.71 0.61 0.65 832
accuracy 0.87 4172
macro avg 0.81 0.77 0.79 4172
weighted avg 0.87 0.87 0.87 4172
Observation:
We have simplified the model while maintaining the same performance on the train data.
# Checking performance on the test data
y_pred_test_selected = lg_selected.predict(X_test_selected_scaled)
metrics_score(y_test, y_pred_test_selected)
precision recall f1-score support
0 0.90 0.95 0.92 1431
1 0.74 0.59 0.66 357
accuracy 0.88 1788
macro avg 0.82 0.77 0.79 1788
weighted avg 0.87 0.88 0.87 1788
Observation:
We have simplified the model while maintaining the same performance on the test data. Let's see if we can improve the performance with feature engineering.
# Feature engineering using polynomials and interactions
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_train_selected_poly = poly.fit_transform(X_train_selected_scaled)
X_test_selected_poly = poly.transform(X_test_selected_scaled)
# Fit logistic regression with polynomial features
lg_selected_poly = LogisticRegression()
lg_selected_poly.fit(X_train_selected_poly, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Checking performance on the training data
y_pred_train_selected_poly = lg_selected_poly.predict(X_train_selected_poly)
metrics_score(y_train, y_pred_train_selected_poly)
precision recall f1-score support
0 0.92 0.95 0.93 3340
1 0.76 0.68 0.72 832
accuracy 0.89 4172
macro avg 0.84 0.81 0.83 4172
weighted avg 0.89 0.89 0.89 4172
Observation:
Performance has improved on the train data compared to the previous model, including the recall on class 1.
# Checking performance on the test data
y_pred_test_selected_poly = lg_selected_poly.predict(X_test_selected_poly)
metrics_score(y_test, y_pred_test_selected_poly)
precision recall f1-score support
0 0.91 0.95 0.93 1431
1 0.77 0.64 0.70 357
accuracy 0.89 1788
macro avg 0.84 0.80 0.82 1788
weighted avg 0.89 0.89 0.89 1788
Observations:
# Plotting the Precision-Recall Curve to adjust classification threshold
# predict_proba gives the probability of each observation belonging to each class
y_scores_lg_selected_poly = lg_selected_poly.predict_proba(X_train_selected_poly)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg_selected_poly[:, 1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize = (10, 7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label='recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0, 1])
plt.title ('Precision-Recall Curve')
plt.show()
Observation:
The precision and the recall are balanced for a threshold of about 0.4. Given that the model seems giving more weight to precision, and that we aim at giving an edge to the recall, let's try a threshold of 0.35 and check the performance of the model at this threshold.
# Setting optimal threshold for classification
optimal_threshold = 0.35
# Creating Logistic Regression with optimal threshold
lg_selected_poly_optimized = LogisticRegression()
lg_selected_poly_optimized.threshold = optimal_threshold
# Fitting the optimized model on the train data
lg_selected_poly_optimized.fit(X_train_selected_poly, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Checking performance on the train data
y_pred_train_lg_selected_poly_optimized = lg_selected_poly_optimized.predict_proba(X_train_selected_poly)
metrics_score(y_train, y_pred_train_lg_selected_poly_optimized[:, 1] > optimal_threshold)
precision recall f1-score support
0 0.94 0.92 0.93 3340
1 0.70 0.77 0.73 832
accuracy 0.89 4172
macro avg 0.82 0.84 0.83 4172
weighted avg 0.89 0.89 0.89 4172
Observation:
The recall for class 1 on the train data has significantly improved while maintaining a robust overall peformance compared to the previous model.
# Checking performance on the test data
y_pred_test_lg_selected_poly_optimized = lg_selected_poly_optimized.predict_proba(X_test_selected_poly)
metrics_score(y_test, y_pred_test_lg_selected_poly_optimized[:, 1] > optimal_threshold)
precision recall f1-score support
0 0.93 0.92 0.93 1431
1 0.70 0.73 0.71 357
accuracy 0.88 1788
macro avg 0.82 0.82 0.82 1788
weighted avg 0.89 0.88 0.88 1788
Observations:
Note: in logistic regression, we addressed outliers and built the model. However, decision trees do not require outlier treatment because they are robust to outliers by design. Decision trees partition data based on feature value thresholds, rather than relying on distance or percentage change, making them resilient to outliers.
Adding binary flags for columns that have missing values
# List of columns that have missing values
missing_col = [col for col in data.columns if data[col].isnull().any()]
for colmn in missing_col:
add_binary_flag(data, colmn)
Treating missing values in numerical columns with median and mode in categorical variables
# Selecting numeric columns
num_data = data.select_dtypes('number')
# Selecting categorical columns
cat_data = data.select_dtypes('category').columns.tolist()
# Filling numeric columns with median
data[num_data.columns] = num_data.fillna(num_data.median())
# Filling categorical columns with mode
for column in cat_data:
mode = data[column].mode()[0]
data[column] = data[column].fillna(mode)
Separating the target variable y and independent variable x
# Separating the target variable and other variables
Y = data.BAD
X = data.drop(columns=['BAD'], axis=1)
# List of columns for which we need to create dummy variables
to_get_dummies_for = ['REASON', 'JOB']
# Creating dummy variables
X = pd.get_dummies(X, columns=to_get_dummies_for, drop_first=True)
Splitting the data into 70% train and 30% test sets
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)
# Define the decistion tree model with adjusted class weight to avoid class imbalance impacting the model
dt = DecisionTreeClassifier (class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Fit the decision tree model
dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)# Checking performance on the training data
y_pred_train_dt = dt.predict(X_train)
metrics_score(y_train, y_pred_train_dt)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
The model is performing robustly in terms of most metrics, with scores of 100% for all of them. We can expect the decision tree to be overfitting the trainig data at this stage, as we have not implemented pruning or other techniques to avoid overfitting.
# Checking performance on the test data
y_pred_test_dt = dt.predict(X_test)
metrics_score(y_test, y_pred_test_dt)
precision recall f1-score support
0 0.91 0.94 0.93 1431
1 0.74 0.62 0.68 357
accuracy 0.88 1788
macro avg 0.82 0.78 0.80 1788
weighted avg 0.88 0.88 0.88 1788
Observations:
# Choosing the type of classifier
dt_tuned = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(2, 15),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [15, 25],
'class_weight': ['balanced', {0: 0.2, 1: 0.8}]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label=1)
# Running the grid search
grid_obj = GridSearchCV(dt_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Setting the classifier to the best combination of parameters
dt_tuned = grid_obj.best_estimator_
# Fitting the optimized algorithm to the data
dt_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=11, min_samples_leaf=25, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=11, min_samples_leaf=25, random_state=1)# Checking performance on the training data
y_pred_train_dt_tuned = dt_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_dt_tuned)
precision recall f1-score support
0 0.97 0.85 0.91 3340
1 0.60 0.89 0.72 832
accuracy 0.86 4172
macro avg 0.78 0.87 0.81 4172
weighted avg 0.90 0.86 0.87 4172
Observation:
Compared to the default model, the performance on the train data indicates improved generalization.
# Checking performance on the test data
y_pred_test_dt_tuned = dt_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_dt_tuned)
precision recall f1-score support
0 0.95 0.84 0.89 1431
1 0.57 0.82 0.67 357
accuracy 0.84 1788
macro avg 0.76 0.83 0.78 1788
weighted avg 0.87 0.84 0.85 1788
Observations:
# Plotting the tuned decision tree
features = list(X.columns)
plt.figure(figsize = (25, 25))
tree.plot_tree(dt_tuned, feature_names=features, filled=True, fontsize=9, node_ids=True, class_names=True, max_depth=3)
plt.show()
Observations:
Left branch:
Right branch:
# Displaying the importance of features to gain a better understanding of the model
print (pd.DataFrame(dt_tuned.feature_importances_, columns = ['Imp'], index=X_train.columns).sort_values(by='Imp', ascending=False))
Imp DEBTINC_missing_values_flag 4.398423e-01 DEBTINC 1.630897e-01 DELINQ 8.407907e-02 CLAGE 6.640726e-02 MORTDUE 5.453941e-02 LOAN 4.529489e-02 YOJ 3.532471e-02 CLNO 2.832708e-02 VALUE 2.630174e-02 DEROG 2.002319e-02 JOB_missing_values_flag 1.171844e-02 JOB_Other 8.564099e-03 DEROG_missing_values_flag 6.012113e-03 NINQ 4.551285e-03 REASON_HomeImp 3.541756e-03 JOB_ProfExe 2.382995e-03 MORTDUE_missing_values_flag 7.311594e-17 VALUE_missing_values_flag 0.000000e+00 REASON_missing_values_flag 0.000000e+00 YOJ_missing_values_flag 0.000000e+00 DELINQ_missing_values_flag 0.000000e+00 CLAGE_missing_values_flag 0.000000e+00 NINQ_missing_values_flag 0.000000e+00 CLNO_missing_values_flag 0.000000e+00 JOB_Office 0.000000e+00 JOB_Sales 0.000000e+00 JOB_Self 0.000000e+00
# Extracting feature names
features = X_train.columns
# Plotting the feature importance
importances = dt_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
# Selecting features in the top 40th percentile of importance
threshold = np.percentile(importances, 40)
top_indices = np.where(importances >= threshold)[0] # Get indices of features above the threshold
selected_features = [features[i] for i in top_indices] # Map indices to feature names
# Create the datasets with selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
# Fit the decision tree with selected features
dt_tuned_selected = DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=11, min_samples_leaf=25, random_state=1)
dt_tuned_selected.fit(X_train_selected, y_train)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=11, min_samples_leaf=25, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=11, min_samples_leaf=25, random_state=1)# Checking performance on the training data
y_pred_train_dt_tuned_selected = dt_tuned_selected.predict(X_train_selected)
metrics_score(y_train, y_pred_train_dt_tuned_selected)
precision recall f1-score support
0 0.97 0.85 0.91 3340
1 0.60 0.89 0.72 832
accuracy 0.86 4172
macro avg 0.78 0.87 0.81 4172
weighted avg 0.90 0.86 0.87 4172
Observation:
We have simplified the model while maintaining the same performance on the train data.
# Checking performance on the training data
y_pred_test_dt_tuned_selected = dt_tuned_selected.predict(X_test_selected)
metrics_score(y_test, y_pred_test_dt_tuned_selected)
precision recall f1-score support
0 0.95 0.84 0.89 1431
1 0.57 0.82 0.67 357
accuracy 0.84 1788
macro avg 0.76 0.83 0.78 1788
weighted avg 0.87 0.84 0.85 1788
Observation:
We have simplified the model while maintaining the same performance on the test data. This suggests that this model can be used without any loss in performance.
# Displaying the importance of features to gain a better understanding of the model
print (pd.DataFrame(dt_tuned_selected.feature_importances_, columns = ['Imp'], index=X_train_selected.columns).sort_values(by='Imp', ascending=False))
Imp DEBTINC_missing_values_flag 0.439842 DEBTINC 0.163090 DELINQ 0.084079 CLAGE 0.066407 MORTDUE 0.054539 LOAN 0.045295 YOJ 0.035325 CLNO 0.028327 VALUE 0.026302 DEROG 0.020023 JOB_missing_values_flag 0.011718 JOB_Other 0.008564 DEROG_missing_values_flag 0.006012 NINQ 0.004551 REASON_HomeImp 0.003542 JOB_ProfExe 0.002383
# Visualizing the feature importance
# Extracting feature names from the selected features
features = X_train_selected.columns
# Plotting the feature importance
importances = dt_tuned_selected.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observation:
So far, this is a promising model, with strong predictive power, simple outlook, and relatively strong performance. Let's see if a random forest model can provide improved performance of further insights.
# Define the random forest classifier
rf_estimator = RandomForestClassifier(random_state=1)
# Fit the random forest classifier
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
The model is likely overfitting the training data, with perfect performance of 100%.
# Checking performance on the test data
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.92 0.97 0.95 1431
1 0.84 0.68 0.75 357
accuracy 0.91 1788
macro avg 0.88 0.82 0.85 1788
weighted avg 0.91 0.91 0.91 1788
Observations:
# Define the random forest classifier with weighted classes
rf_estimator_weighted = RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Fit the random forest classifier with weighted classes
rf_estimator_weighted.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)# Checking performance on the training data
y_pred_train_rf_weighted = rf_estimator_weighted.predict(X_train)
metrics_score(y_train, y_pred_train_rf_weighted)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
The model is likely overfitting the training data, with perfect performance of 100%.
# Checking performance on the test data
y_pred_test_rf_weighted = rf_estimator_weighted.predict(X_test)
metrics_score(y_test, y_pred_test_rf_weighted)
precision recall f1-score support
0 0.92 0.97 0.94 1431
1 0.84 0.64 0.73 357
accuracy 0.90 1788
macro avg 0.88 0.81 0.83 1788
weighted avg 0.90 0.90 0.90 1788
Observations:
# Choosing the type of classifier
rf_estimator_tuned = RandomForestClassifier(n_jobs=-1, random_state=1)
# Grid of parameters to choose from
parameters_rf = {
'n_estimators': [140, 150],
'max_depth': [8, 9],
'min_samples_split': [80, 90, 100],
'class_weight': [None, 'balanced',{0: 0.2, 1: 0.8}],
'criterion': ['gini', 'entropy']
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score, pos_label=1)
# Defining the random search on the training data using scorer=scorer and cv=5
random_search = RandomizedSearchCV(
rf_estimator_tuned,
parameters_rf,
n_iter=10, # Number of parameter settings sampled
scoring=scorer,
cv=5,
random_state=1
)
# Running the random search on the training data
random_search.fit(X_train, y_train)
# Saving the best estimator from the random search to variable rf_estimator_tuned3
rf_estimator_tuned = random_search.best_estimator_
#Fitting the best estimator to the training data
rf_estimator_tuned.fit (X_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=9,
min_samples_split=80, n_estimators=150, n_jobs=-1,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=9,
min_samples_split=80, n_estimators=150, n_jobs=-1,
random_state=1)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 0.96 0.89 0.92 3340
1 0.66 0.84 0.74 832
accuracy 0.88 4172
macro avg 0.81 0.87 0.83 4172
weighted avg 0.90 0.88 0.89 4172
Observation:
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.94 0.89 0.92 1431
1 0.65 0.79 0.71 357
accuracy 0.87 1788
macro avg 0.80 0.84 0.81 1788
weighted avg 0.89 0.87 0.88 1788
Observations:
# Importance of features in the random forest
print (pd.DataFrame(rf_estimator_tuned.feature_importances_, columns = ['Imp'], index=X_train.columns).sort_values(by='Imp', ascending=False))
Imp DEBTINC_missing_values_flag 0.330461 DEBTINC 0.240336 DELINQ 0.102851 CLAGE 0.062426 DEROG 0.061291 VALUE_missing_values_flag 0.029020 LOAN 0.028893 VALUE 0.028506 CLNO 0.023327 MORTDUE 0.021476 NINQ 0.018746 YOJ 0.015548 DEROG_missing_values_flag 0.006304 JOB_Office 0.004171 JOB_Sales 0.003805 CLAGE_missing_values_flag 0.003501 JOB_missing_values_flag 0.002926 YOJ_missing_values_flag 0.002762 DELINQ_missing_values_flag 0.002521 REASON_HomeImp 0.002208 JOB_Other 0.001561 CLNO_missing_values_flag 0.001465 JOB_ProfExe 0.001418 MORTDUE_missing_values_flag 0.001220 JOB_Self 0.001136 REASON_missing_values_flag 0.001133 NINQ_missing_values_flag 0.000988
# Extract feature importance
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
# Plot feature importance
feature_names = list(X.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
Since the data prepared for logistic regression is scaled, it can be used for building a KNN model.
# Selecting the optimal value of k through error rate analysis
# Initialize a dictionary to store the average errors for different values of k
knn_errors = {}
for k in range(1, 15): # Looping over values of k from 1 to 14
train_errors = []
validation_errors = []
# Creating a KNN model with the current value of k
knn = KNeighborsClassifier(n_neighbors=k)
for i in range(30): # Repeating the process 30 times for each k
# Splitting the training data further for validation
x_train_new, x_val, y_train_new, y_val = train_test_split(X_train_scaled, y_train, test_size=0.20, random_state=i)
# Fitting KNN on the newly created training data
knn.fit(x_train_new, y_train_new)
# Calculating errors on training data and validation data
train_errors.append(1 - knn.score(x_train_new, y_train_new))
validation_errors.append(1 - knn.score(x_val, y_val))
# Calculating the average error for each k value
knn_errors[k] = [np.mean(train_errors), np.mean(validation_errors)]
# Displaying the errors for different values of k
knn_errors
{1: [0.0, 0.10726546906187623],
2: [0.0731894915592848, 0.13285429141716565],
3: [0.06431924882629107, 0.1168063872255489],
4: [0.09819198881230645, 0.13089820359281437],
5: [0.08934172410348618, 0.1219560878243513],
6: [0.10755169313754867, 0.1322554890219561],
7: [0.10174807711517331, 0.1252694610778443],
8: [0.1131755069423634, 0.13097804391217566],
9: [0.10862051743082611, 0.12682634730538922],
10: [0.11666167216062331, 0.13137724550898203],
11: [0.11368494655878533, 0.1299001996007984],
12: [0.11829987014284288, 0.13269461077844313],
13: [0.11555289181899911, 0.13173652694610777],
14: [0.12050744181400459, 0.1346506986027944]}
# Plotting k versus error
kltest = [] # List for storing k values for the test set
vltest = [] # List for storing error rates for the test set
# Extracting k values and corresponding validation errors from knn_errors
for k, v in knn_errors.items():
kltest.append(k) # Append k value
vltest.append(v[1]) # Append validation error
kltrain = [] # List for storing k values for the train set
vltrain = [] # List for storing error rates for the train set
# Extracting k values and corresponding training errors from knn_errors
for k, v in knn_errors.items():
kltrain.append(k) # Append k value
vltrain.append(v[0]) # Append training error
# Plotting K vs Error
plt.figure(figsize=(10, 6))
plt.plot(kltest, vltest, label='Validation Error')
plt.plot(kltrain, vltrain, label='Training Error')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Error Rate')
plt.title('K vs Error Rate')
plt.legend()
plt.show()
Observations:
# Define the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
# Fit the KNN model
knn.fit(X_train_scaled, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
# Checking performance on the training data
y_pred_train_knn = knn.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_knn)
precision recall f1-score support
0 0.92 0.98 0.95 3340
1 0.87 0.67 0.76 832
accuracy 0.91 4172
macro avg 0.90 0.82 0.85 4172
weighted avg 0.91 0.91 0.91 4172
Observation:
The recall for class 1 on the train data is low, althought overall peformance is strong.
# Checking performance on the test data
y_pred_test_knn = knn.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_knn)
precision recall f1-score support
0 0.90 0.96 0.93 1431
1 0.77 0.58 0.66 357
accuracy 0.88 1788
macro avg 0.83 0.77 0.79 1788
weighted avg 0.87 0.88 0.87 1788
Observation:
The recall for class 1 on the test data is significantly weak. Let's try to fine tune this model to see if we can improve the recall.
# Pipeline with SMOTE and KNN
pipeline = Pipeline([
('smote', SMOTE(random_state=1)),
('knn', KNeighborsClassifier())
])
# Define the parameter grid for KNN
params_knn = {
'knn__n_neighbors': np.arange(3, 15),
'knn__weights': ['uniform', 'distance'],
'knn__p': [1, 2]
}
# Custom scorer for recall class 1
recall_scorer = make_scorer(recall_score, pos_label=1)
# GridSearchCV for tuning
grid_knn = GridSearchCV(
pipeline,
params_knn,
scoring=recall_scorer,
cv=10,
verbose=2,
n_jobs=-1
)
# Fit grid_knn to the data
grid_knn.fit(X_train_scaled, y_train)
# Best estimator after grid search
knn_tuned = grid_knn.best_estimator_
# Fitting the best model to the training data
knn_tuned.fit(X_train_scaled, y_train)
Fitting 10 folds for each of 48 candidates, totalling 480 fits
Pipeline(steps=[('smote', SMOTE(random_state=1)),
('knn',
KNeighborsClassifier(n_neighbors=10, p=1,
weights='distance'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('smote', SMOTE(random_state=1)),
('knn',
KNeighborsClassifier(n_neighbors=10, p=1,
weights='distance'))])SMOTE(random_state=1)
KNeighborsClassifier(n_neighbors=10, p=1, weights='distance')
# Checking performance on the training data
y_pred_train_knn_best = knn_tuned.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_knn_best)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
The model seems overfitting the trainning data, with perfect scores of 100%.
# Checking performance on the test data
y_pred_test_knn_best = knn_tuned.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_knn_best)
precision recall f1-score support
0 0.95 0.94 0.95 1431
1 0.78 0.81 0.80 357
accuracy 0.92 1788
macro avg 0.87 0.88 0.87 1788
weighted avg 0.92 0.92 0.92 1788
Observations:
# Fitting the SHAP explainer
explainer = sh.Explainer(knn_tuned.predict, X_test_scaled)
# Calculating the SHAP values
shap_values = explainer(X_test_scaled)
PermutationExplainer explainer: 1789it [1:12:19, 2.43s/it]
# Plotting the SHAP values
sh.plots.bar(shap_values)
Observations:
LDA is less sensitive to the scale of the data because it focuses on maximizing the separability between different classes based on variance, rather than on the absolute values of the features. We will use the scaled variables for consistency in this section.
# Define the LDA model
lda = LinearDiscriminantAnalysis()
# Fit the LDA model
lda.fit(X_train_scaled, y_train)
LinearDiscriminantAnalysis()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearDiscriminantAnalysis()
# Checking performance on the training data
y_pred_train_lda = lda.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_lda)
precision recall f1-score support
0 0.92 0.91 0.91 3340
1 0.64 0.67 0.66 832
accuracy 0.86 4172
macro avg 0.78 0.79 0.78 4172
weighted avg 0.86 0.86 0.86 4172
Observation:
The model shows a relatively modest performance on the train data.
# Checking performance on the test data
y_pred_test_lda = lda.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_lda)
precision recall f1-score support
0 0.91 0.91 0.91 1431
1 0.65 0.66 0.65 357
accuracy 0.86 1788
macro avg 0.78 0.79 0.78 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
Although accuracy is about 86%, the model struggles to adequately identify the minority class, including a relatively low recall for class 1.
# Creating list of column names
cols = X_train_scaled.columns
# Saving coefficients (discriminant loadings) of LDA model
coef_lda = lda.coef_
# Printing the coefficients
pd.DataFrame(coef_lda, columns=cols).T.sort_values(by=0, ascending=False)
| 0 | |
|---|---|
| DEBTINC_missing_values_flag | 1.889087e+00 |
| VALUE_missing_values_flag | 6.782407e-01 |
| DEBTINC | 4.357623e-01 |
| NINQ | 2.455292e-01 |
| CLAGE_missing_values_flag | 2.449853e-01 |
| CLNO_missing_values_flag | 1.985319e-01 |
| MORTDUE_missing_values_flag | 1.810335e-01 |
| REASON_HomeImp | 1.639822e-01 |
| JOB_Sales | 1.451587e-01 |
| CLNO | 7.817623e-02 |
| NINQ_missing_values_flag | 6.714422e-02 |
| VALUE | 6.635380e-02 |
| JOB_Self | 5.636966e-02 |
| REASON_missing_values_flag | 5.065427e-02 |
| DELINQ | 3.205626e-16 |
| DEROG | 7.749425e-17 |
| JOB_Other | -6.563267e-03 |
| LOAN | -7.920144e-02 |
| JOB_ProfExe | -1.369012e-01 |
| YOJ | -1.585712e-01 |
| MORTDUE | -1.797254e-01 |
| JOB_Office | -1.808192e-01 |
| YOJ_missing_values_flag | -2.150189e-01 |
| DEROG_missing_values_flag | -2.432755e-01 |
| DELINQ_missing_values_flag | -2.608143e-01 |
| JOB_missing_values_flag | -3.385237e-01 |
| CLAGE | -4.292882e-01 |
Observations:
# Getting probabilities for each class
y_scores_lda = lda.predict_proba(X_train_scaled)
# Calculating precision and recall for various thresholds
precisions_lda, recalls_lda, thresholds_lda = precision_recall_curve(y_train, y_scores_lda[:, 1])
# Plotting precision and recall as functions of the threshold
plt.figure(figsize=(10, 7))
plt.plot(thresholds_lda, precisions_lda[:-1], 'b--', label='Precision')
plt.plot(thresholds_lda, recalls_lda[:-1], 'g--', label='Recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0, 1])
plt.show()
Observation:
We can see that the precision and the recall are balanced for a threshold of about ~0.55. However, as precision remains relatively high despite a lower threshold, we can experiment with a value of 0.2.
# Setting a threshold for classification
optimal_threshold_lda = 0.2
# Creating LDA model with the new threshold
lda_optimized = LinearDiscriminantAnalysis()
lda_optimized.threshold = optimal_threshold_lda
# Fitting the LDA model on the training data
lda_optimized.fit(X_train_scaled, y_train)
LinearDiscriminantAnalysis()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearDiscriminantAnalysis()
# Predict probabilities on the training data
y_scores_train_lda = lda_optimized.predict_proba(X_train_scaled)
# Apply the threshold to get binary predictions
y_pred_train_binary_lda = (y_scores_train_lda[:, 1] > lda_optimized.threshold).astype(int)
# Checking performance on the training data
metrics_score(y_train, y_pred_train_binary_lda)
precision recall f1-score support
0 0.92 0.90 0.91 3340
1 0.63 0.71 0.67 832
accuracy 0.86 4172
macro avg 0.78 0.80 0.79 4172
weighted avg 0.87 0.86 0.86 4172
Observation:
The recall on class 1 on the train data has improved compared to the default LDA model, maintaining a high accuracy.
# Predict probabilities on the test data
y_scores_test_lda = lda_optimized.predict_proba(X_test_scaled)
# Apply the optimal threshold to get binary predictions
y_pred_test_binary_lda = (y_scores_test_lda[:, 1] > lda_optimized.threshold).astype(int)
# Evaluate performance on the test data
metrics_score(y_test, y_pred_test_binary_lda)
precision recall f1-score support
0 0.92 0.90 0.91 1431
1 0.64 0.70 0.67 357
accuracy 0.86 1788
macro avg 0.78 0.80 0.79 1788
weighted avg 0.87 0.86 0.86 1788
Observations:
Quadratic Discriminant Analysis (QDA) is similar to LDA in that it is less sensitive to the scale of features compared to distance-based algorithms like KNN or SVM. We will use scaled features for consistency. Given that QDA can be particularly sensitive to irrelevant features, we will use the feature selected X matrices.
# Define the QDA model
qda = QuadraticDiscriminantAnalysis()
# Fit the QDA model
qda.fit(X_train_selected_scaled, y_train)
QuadraticDiscriminantAnalysis()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
QuadraticDiscriminantAnalysis()
# Predicting on the training data
y_pred_train_qda = qda.predict(X_train_selected_scaled)
# Evaluating performance on the training data
metrics_score(y_train, y_pred_train_qda)
precision recall f1-score support
0 0.91 0.91 0.91 3340
1 0.65 0.65 0.65 832
accuracy 0.86 4172
macro avg 0.78 0.78 0.78 4172
weighted avg 0.86 0.86 0.86 4172
Observation:
Performance is moderate on the train data, including weak class 1 recall.
# Predicting on the test data
y_pred_test_qda = qda.predict(X_test_selected_scaled)
# Evaluating performance on the test data
metrics_score(y_test, y_pred_test_qda)
precision recall f1-score support
0 0.91 0.92 0.92 1431
1 0.67 0.63 0.65 357
accuracy 0.87 1788
macro avg 0.79 0.78 0.78 1788
weighted avg 0.86 0.87 0.86 1788
Observation:
The model is struggling to adequately identify the minority class, with low metrics for class 1. Let's see if we can achieve improved performance with other models.
# Define the SVM model
svm = SVC(probability=True) # 'probability=True' is needed for predict_proba
# Fit the SVM model
svm.fit(X_train_scaled, y_train)
SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(probability=True)
# Checking performance on the training data
y_pred_train_svm = svm.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
precision recall f1-score support
0 0.93 0.95 0.94 3340
1 0.80 0.71 0.75 832
accuracy 0.91 4172
macro avg 0.86 0.83 0.84 4172
weighted avg 0.90 0.91 0.90 4172
Observation:
Performance on the train data seems relatively strong, although the model lags in the recall for class 1.
# Checking performance on the test dataset
y_pred_test_svm = svm.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
precision recall f1-score support
0 0.91 0.95 0.93 1431
1 0.76 0.64 0.69 357
accuracy 0.89 1788
macro avg 0.84 0.79 0.81 1788
weighted avg 0.88 0.89 0.88 1788
Observation:
The recall for class is 1 on the test data is relatively low. Let's see if we can improve performance by tuning the model.
# Pipeline with SMOTE and SVM
pipeline = Pipeline([
('smote', SMOTE(random_state=1)),
('svm', SVC(random_state=1))
])
# Define a parameter grid for SVM
param_grid = {
'svm__C': [1, 5, 10],
'svm__gamma': ['scale', 0.05, 0.1],
'svm__kernel': ['rbf', 'poly', 'sigmoid']
}
# Custom scorer for recall class 1
recall_scorer = make_scorer(recall_score, pos_label=1)
# RandomizedSearchCV for tuning with SMOTE
random_search_svm = RandomizedSearchCV(
pipeline,
param_grid,
n_iter=10,
scoring=recall_scorer,
cv=3,
verbose=2,
n_jobs=-1,
random_state=1
)
# Fit random_search to the data
random_search_svm.fit(X_train_scaled, y_train)
# Best estimator after random search
svm_tuned= random_search_svm.best_estimator_
# Fitting the best SVM model to the training data
svm_tuned.fit(X_train_scaled, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Pipeline(steps=[('smote', SMOTE(random_state=1)),
('svm', SVC(C=1, gamma=0.05, random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('smote', SMOTE(random_state=1)),
('svm', SVC(C=1, gamma=0.05, random_state=1))])SMOTE(random_state=1)
SVC(C=1, gamma=0.05, random_state=1)
# Checking performance on the training data
y_pred_train_svm = svm_tuned.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
precision recall f1-score support
0 0.97 0.91 0.94 3340
1 0.72 0.89 0.79 832
accuracy 0.91 4172
macro avg 0.84 0.90 0.87 4172
weighted avg 0.92 0.91 0.91 4172
Observation:
The recall for class 1 on the train data significantly increased from the previous model, while maintaining a robust performance.
# Checking performance on the test data
y_pred_test_svm = svm_tuned.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
precision recall f1-score support
0 0.94 0.91 0.93 1431
1 0.69 0.78 0.73 357
accuracy 0.89 1788
macro avg 0.82 0.85 0.83 1788
weighted avg 0.89 0.89 0.89 1788
Observations:
# Define the AdaBoost Classifier
adaboost = AdaBoostClassifier(random_state=1)
# Fit the AdaBoost Classifier
adaboost.fit(X_train, y_train)
AdaBoostClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_adaboost = adaboost.predict(X_train)
metrics_score(y_train, y_pred_train_adaboost)
precision recall f1-score support
0 0.93 0.97 0.95 3340
1 0.84 0.69 0.76 832
accuracy 0.91 4172
macro avg 0.88 0.83 0.85 4172
weighted avg 0.91 0.91 0.91 4172
Observation:
While accuracy is relatively strong on the train data, the model shows a low recall for class 1.
# Checking performance on the test dataset
y_pred_test_adaboost = adaboost.predict(X_test)
metrics_score(y_test, y_pred_test_adaboost)
precision recall f1-score support
0 0.91 0.97 0.94 1431
1 0.84 0.59 0.70 357
accuracy 0.90 1788
macro avg 0.87 0.78 0.82 1788
weighted avg 0.89 0.90 0.89 1788
Observation:
While accuracy remains relatively strong on the test data, the recall for class 1 is particularly low. Let's try to improve performance by hyperparameter tuning.
# Creating a pipeline with SMOTE and AdaBoost
pipeline = Pipeline([
('smote', SMOTE(random_state=1)),
('adaboost', AdaBoostClassifier(random_state=1))
])
# Define the parameter grid
param_grid = {
'adaboost__n_estimators': [50, 100, 150],
'adaboost__learning_rate': [0.01, 0.1, 1.0]
}
# Custom scorer for recall class 1
recall_scorer = make_scorer(recall_score, pos_label=1)
# RandomizedSearchCV for tuning
random_search_adaboost = RandomizedSearchCV(
pipeline,
param_grid,
n_iter=10,
scoring=recall_scorer,
cv=3,
verbose=2,
n_jobs=-1,
random_state=1
)
# Fit random_search to the data
random_search_adaboost.fit(X_train, y_train)
# Best estimator after random search
adaboost_tuned = random_search_adaboost.best_estimator_
# Fitting the best model to the training data
adaboost_tuned.fit(X_train, y_train)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
Pipeline(steps=[('smote', SMOTE(random_state=1)),
('adaboost',
AdaBoostClassifier(learning_rate=0.1, n_estimators=100,
random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('smote', SMOTE(random_state=1)),
('adaboost',
AdaBoostClassifier(learning_rate=0.1, n_estimators=100,
random_state=1))])SMOTE(random_state=1)
AdaBoostClassifier(learning_rate=0.1, n_estimators=100, random_state=1)
# Checking performance on the training data
y_pred_train_adaboost_tuned = adaboost_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_adaboost_tuned)
precision recall f1-score support
0 0.94 0.90 0.92 3340
1 0.65 0.78 0.71 832
accuracy 0.87 4172
macro avg 0.80 0.84 0.82 4172
weighted avg 0.89 0.87 0.88 4172
Observation:
The class 1 recall on the train data has improved, although it remains moderate.
# Checking performance on the test data
y_pred_test_adaboost_tuned = adaboost_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_adaboost_tuned)
precision recall f1-score support
0 0.93 0.90 0.92 1431
1 0.65 0.75 0.69 357
accuracy 0.87 1788
macro avg 0.79 0.82 0.81 1788
weighted avg 0.88 0.87 0.87 1788
Observation:
The metrics for class 1 on the test data remain weak. Let's check if we can get improved performance with a gradient boosting model.
# Define the Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state=1)
# Fit the Gradient Boosting Classifier
gbc.fit(X_train, y_train)
GradientBoostingClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_gbc = gbc.predict(X_train)
metrics_score(y_train, y_pred_train_gbc)
precision recall f1-score support
0 0.94 0.98 0.96 3340
1 0.89 0.74 0.81 832
accuracy 0.93 4172
macro avg 0.91 0.86 0.88 4172
weighted avg 0.93 0.93 0.93 4172
Observation:
Despite a strong accuracy on the train data, the recall for class 1 is only decent.
# Checking performance on the test data
y_pred_test_gbc = gbc.predict(X_test)
metrics_score(y_test, y_pred_test_gbc)
precision recall f1-score support
0 0.91 0.97 0.94 1431
1 0.85 0.62 0.72 357
accuracy 0.90 1788
macro avg 0.88 0.80 0.83 1788
weighted avg 0.90 0.90 0.90 1788
Observation:
Despite maintaining a strong accuracy on the test data, the recall for class 1 is low. Let's try to improve performance by hyperparameter tuning.
# Pipeline with SMOTE and Gradient Boosting
pipeline = Pipeline([
('smote', SMOTE(random_state=1)),
('gbc', GradientBoostingClassifier(random_state=1))
])
# Define a parameter grid
param_grid = {
'gbc__n_estimators': [150, 200],
'gbc__learning_rate': [0.01, 0.05],
'gbc__max_depth': [2, 3],
'gbc__min_samples_split': [4, 6],
'gbc__min_samples_leaf': [1, 2]
}
# Custom scorer for recall class 1
recall_scorer = make_scorer(recall_score, pos_label=1)
# RandomizedSearchCV for tuning
random_search_gbc = RandomizedSearchCV(
pipeline,
param_grid,
n_iter=10,
scoring=recall_scorer,
cv=3,
verbose=2,
n_jobs=-1,
random_state=1
)
# Fit random_search to the data
random_search_gbc.fit(X_train, y_train)
# Best estimator after random search
gbc_tuned = random_search_gbc.best_estimator_
# Fitting the best model to the training data
gbc_tuned.fit(X_train, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Pipeline(steps=[('smote', SMOTE(random_state=1)),
('gbc',
GradientBoostingClassifier(learning_rate=0.05, max_depth=2,
min_samples_split=4,
n_estimators=200,
random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('smote', SMOTE(random_state=1)),
('gbc',
GradientBoostingClassifier(learning_rate=0.05, max_depth=2,
min_samples_split=4,
n_estimators=200,
random_state=1))])SMOTE(random_state=1)
GradientBoostingClassifier(learning_rate=0.05, max_depth=2, min_samples_split=4,
n_estimators=200, random_state=1)# Checking performance on the training data
y_pred_train_gbc_tuned = gbc_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_gbc_tuned)
precision recall f1-score support
0 0.94 0.92 0.93 3340
1 0.72 0.78 0.75 832
accuracy 0.90 4172
macro avg 0.83 0.85 0.84 4172
weighted avg 0.90 0.90 0.90 4172
Observation:
The model has a relatively strong performance on the train data, although the recall for class 1 remains decent.
# Checking performance on the test data
y_pred_test_gbc_tuned = gbc_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_gbc_tuned)
precision recall f1-score support
0 0.93 0.93 0.93 1431
1 0.71 0.72 0.72 357
accuracy 0.89 1788
macro avg 0.82 0.82 0.82 1788
weighted avg 0.89 0.89 0.89 1788
Observation:
Overall performance on the test data remains robust, although the recall for class 1 continues to lag. Let's check how an XGBClassifier performs on our data.
# Define the XGBoost Classifier
xgb = XGBClassifier(random_state=1, eval_metric='logloss')
# Fit the XGBoost Classifier
xgb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Checking performance on the training data
y_pred_train_xgb = xgb.predict(X_train)
metrics_score(y_train, y_pred_train_xgb)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
The model seems overfitting the training data, with perfect scores of 100%.
# Checking performance on the test data
y_pred_test_xgb = xgb.predict(X_test)
metrics_score(y_test, y_pred_test_xgb)
precision recall f1-score support
0 0.93 0.98 0.95 1431
1 0.88 0.71 0.79 357
accuracy 0.92 1788
macro avg 0.91 0.85 0.87 1788
weighted avg 0.92 0.92 0.92 1788
Observation:
Overall performance is strong, but the recall for class 1 is moderate. Let's see if we can get better performance by hyperparameter tuning.
# Pipeline with SMOTE and XGBoost
pipeline = Pipeline([
('smote', SMOTE(random_state=1)),
('xgb', XGBClassifier(random_state=1, eval_metric='logloss'))
])
# Parameter grid
param_grid = {
'xgb__n_estimators': [150, 200],
'xgb__learning_rate': [0.05, 0.1],
'xgb__max_depth': [3, 4],
'xgb__gamma': [0.2, 0.3, 0.4],
'xgb__min_child_weight': [2, 3],
'xgb__scale_pos_weight': [2, 3, 4],
'xgb__lambda': [1, 1.5, 2] # L2 regularization
}
# Custom scorer for recall class 1
recall_scorer = make_scorer(recall_score, pos_label=1)
# RandomizedSearchCV for tuning
random_search_xgb = RandomizedSearchCV(
pipeline,
param_grid,
n_iter=10,
scoring=recall_scorer,
cv=3,
verbose=2,
n_jobs=-1,
random_state=1
)
# Fit random_search to the data
random_search_xgb.fit(X_train, y_train)
# Best estimator after random search
xgb_tuned = random_search_xgb.best_estimator_
# Fitting the best model
xgb_tuned.fit(X_train, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Pipeline(steps=[('smote', SMOTE(random_state=1)),
('xgb',
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0.4, grow_policy=None,
importance_type=None,
interaction_constraints=None, lambda=1.5,
learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3,
max_leaves=None, min_child_weight=2, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=200, n_jobs=None,
num_parallel_tree=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('smote', SMOTE(random_state=1)),
('xgb',
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0.4, grow_policy=None,
importance_type=None,
interaction_constraints=None, lambda=1.5,
learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3,
max_leaves=None, min_child_weight=2, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=200, n_jobs=None,
num_parallel_tree=None, ...))])SMOTE(random_state=1)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0.4, grow_policy=None,
importance_type=None, interaction_constraints=None, lambda=1.5,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
max_leaves=None, min_child_weight=2, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, ...)# Checking performance on the training data
y_pred_train_xgb_tuned= xgb_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_xgb_tuned)
precision recall f1-score support
0 0.98 0.90 0.94 3340
1 0.70 0.92 0.80 832
accuracy 0.91 4172
macro avg 0.84 0.91 0.87 4172
weighted avg 0.92 0.91 0.91 4172
Observation:
Performance on the train data is robust, including high recall on class 1.
# Checking performance on the test data
y_pred_test_xgb_tuned= xgb_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_xgb_tuned)
precision recall f1-score support
0 0.95 0.89 0.92 1431
1 0.66 0.82 0.73 357
accuracy 0.88 1788
macro avg 0.80 0.85 0.83 1788
weighted avg 0.89 0.88 0.88 1788
Observation:
The recall for class 1 on the test data at 0.82 is as high as the decision tree's, thus the model stands among our strongest predictors.
Clearing the backend and fixing seeds for random number generators
# Clearing the backend
K.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
Building a feed forward Artificial Neural Network (ANN) with 2 hidden layers and the output layer
# We will be adding the layers sequentially
ann_model = Sequential()
# First hidden layer with 50 neurons and 'relu' activation. Input shape indicates the number of features in the dataset.
ann_model.add(Dense(50, activation='relu', input_shape=(27,)))
# Adding Dropout to prevent overfitting (20% of neurons)
ann_model.add(Dropout(0.2))
# Second hidden layer with 12 neurons and 'relu' activation
ann_model.add(Dense(12, activation='relu'))
# Adding Dropout to prevent overfitting (10% of neurons)
ann_model.add(Dropout(0.1))
# Output layer with 1 neuron (binary classification) and 'sigmoid' activation
ann_model.add(Dense(1, activation='sigmoid'))
# Compile the model
ann_model.compile(loss='binary_crossentropy',
optimizer='adamax',
metrics=['accuracy'])
# Model summary
ann_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 50) 1400
dropout (Dropout) (None, 50) 0
dense_1 (Dense) (None, 12) 612
dropout_1 (Dropout) (None, 12) 0
dense_2 (Dense) (None, 1) 13
=================================================================
Total params: 2025 (7.91 KB)
Trainable params: 2025 (7.91 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
# Set class weights
class_weights = class_weight.compute_class_weight(
'balanced',
classes=np.unique(y_train),
y=y_train
)
class_weights_dict = dict(enumerate(class_weights))
# Fit the model
history_1 = ann_model.fit(
X_train_scaled, y_train,
validation_split=0.1,
epochs=150,
class_weight=class_weights_dict,
verbose=2
)
Epoch 1/150 118/118 - 1s - loss: 0.6832 - accuracy: 0.6420 - val_loss: 0.6424 - val_accuracy: 0.6124 - 1s/epoch - 12ms/step Epoch 2/150 118/118 - 0s - loss: 0.6184 - accuracy: 0.6878 - val_loss: 0.6044 - val_accuracy: 0.7105 - 268ms/epoch - 2ms/step Epoch 3/150 118/118 - 0s - loss: 0.5708 - accuracy: 0.7275 - val_loss: 0.5568 - val_accuracy: 0.7560 - 255ms/epoch - 2ms/step Epoch 4/150 118/118 - 0s - loss: 0.5286 - accuracy: 0.7778 - val_loss: 0.5153 - val_accuracy: 0.7919 - 295ms/epoch - 2ms/step Epoch 5/150 118/118 - 0s - loss: 0.4944 - accuracy: 0.8093 - val_loss: 0.4801 - val_accuracy: 0.8134 - 268ms/epoch - 2ms/step Epoch 6/150 118/118 - 0s - loss: 0.4766 - accuracy: 0.8159 - val_loss: 0.4637 - val_accuracy: 0.8278 - 267ms/epoch - 2ms/step Epoch 7/150 118/118 - 0s - loss: 0.4607 - accuracy: 0.8226 - val_loss: 0.4517 - val_accuracy: 0.8349 - 246ms/epoch - 2ms/step Epoch 8/150 118/118 - 0s - loss: 0.4660 - accuracy: 0.8191 - val_loss: 0.4523 - val_accuracy: 0.8349 - 261ms/epoch - 2ms/step Epoch 9/150 118/118 - 0s - loss: 0.4521 - accuracy: 0.8295 - val_loss: 0.4441 - val_accuracy: 0.8445 - 258ms/epoch - 2ms/step Epoch 10/150 118/118 - 0s - loss: 0.4439 - accuracy: 0.8261 - val_loss: 0.4402 - val_accuracy: 0.8493 - 268ms/epoch - 2ms/step Epoch 11/150 118/118 - 0s - loss: 0.4416 - accuracy: 0.8311 - val_loss: 0.4328 - val_accuracy: 0.8517 - 253ms/epoch - 2ms/step Epoch 12/150 118/118 - 0s - loss: 0.4406 - accuracy: 0.8370 - val_loss: 0.4312 - val_accuracy: 0.8493 - 245ms/epoch - 2ms/step Epoch 13/150 118/118 - 0s - loss: 0.4362 - accuracy: 0.8396 - val_loss: 0.4254 - val_accuracy: 0.8541 - 254ms/epoch - 2ms/step Epoch 14/150 118/118 - 0s - loss: 0.4333 - accuracy: 0.8404 - val_loss: 0.4301 - val_accuracy: 0.8493 - 251ms/epoch - 2ms/step Epoch 15/150 118/118 - 0s - loss: 0.4286 - accuracy: 0.8370 - val_loss: 0.4255 - val_accuracy: 0.8493 - 242ms/epoch - 2ms/step Epoch 16/150 118/118 - 0s - loss: 0.4310 - accuracy: 0.8370 - val_loss: 0.4229 - val_accuracy: 0.8517 - 241ms/epoch - 2ms/step Epoch 17/150 118/118 - 0s - loss: 0.4239 - accuracy: 0.8434 - val_loss: 0.4211 - val_accuracy: 0.8517 - 242ms/epoch - 2ms/step Epoch 18/150 118/118 - 0s - loss: 0.4234 - accuracy: 0.8444 - val_loss: 0.4165 - val_accuracy: 0.8541 - 250ms/epoch - 2ms/step Epoch 19/150 118/118 - 0s - loss: 0.4081 - accuracy: 0.8532 - val_loss: 0.4133 - val_accuracy: 0.8517 - 243ms/epoch - 2ms/step Epoch 20/150 118/118 - 0s - loss: 0.4131 - accuracy: 0.8487 - val_loss: 0.4164 - val_accuracy: 0.8493 - 253ms/epoch - 2ms/step Epoch 21/150 118/118 - 0s - loss: 0.4172 - accuracy: 0.8396 - val_loss: 0.4110 - val_accuracy: 0.8589 - 249ms/epoch - 2ms/step Epoch 22/150 118/118 - 0s - loss: 0.4103 - accuracy: 0.8506 - val_loss: 0.4157 - val_accuracy: 0.8493 - 253ms/epoch - 2ms/step Epoch 23/150 118/118 - 0s - loss: 0.4094 - accuracy: 0.8524 - val_loss: 0.4126 - val_accuracy: 0.8565 - 238ms/epoch - 2ms/step Epoch 24/150 118/118 - 0s - loss: 0.4096 - accuracy: 0.8388 - val_loss: 0.4109 - val_accuracy: 0.8541 - 244ms/epoch - 2ms/step Epoch 25/150 118/118 - 0s - loss: 0.4016 - accuracy: 0.8500 - val_loss: 0.4088 - val_accuracy: 0.8565 - 247ms/epoch - 2ms/step Epoch 26/150 118/118 - 0s - loss: 0.4006 - accuracy: 0.8500 - val_loss: 0.4080 - val_accuracy: 0.8541 - 255ms/epoch - 2ms/step Epoch 27/150 118/118 - 0s - loss: 0.4026 - accuracy: 0.8508 - val_loss: 0.4101 - val_accuracy: 0.8517 - 246ms/epoch - 2ms/step Epoch 28/150 118/118 - 0s - loss: 0.4003 - accuracy: 0.8468 - val_loss: 0.4016 - val_accuracy: 0.8541 - 248ms/epoch - 2ms/step Epoch 29/150 118/118 - 0s - loss: 0.3967 - accuracy: 0.8442 - val_loss: 0.4006 - val_accuracy: 0.8493 - 244ms/epoch - 2ms/step Epoch 30/150 118/118 - 0s - loss: 0.3921 - accuracy: 0.8535 - val_loss: 0.3910 - val_accuracy: 0.8565 - 237ms/epoch - 2ms/step Epoch 31/150 118/118 - 0s - loss: 0.3874 - accuracy: 0.8522 - val_loss: 0.3948 - val_accuracy: 0.8517 - 250ms/epoch - 2ms/step Epoch 32/150 118/118 - 0s - loss: 0.3935 - accuracy: 0.8442 - val_loss: 0.3922 - val_accuracy: 0.8541 - 249ms/epoch - 2ms/step Epoch 33/150 118/118 - 0s - loss: 0.3922 - accuracy: 0.8532 - val_loss: 0.3932 - val_accuracy: 0.8517 - 249ms/epoch - 2ms/step Epoch 34/150 118/118 - 0s - loss: 0.3875 - accuracy: 0.8498 - val_loss: 0.3905 - val_accuracy: 0.8517 - 243ms/epoch - 2ms/step Epoch 35/150 118/118 - 0s - loss: 0.3907 - accuracy: 0.8514 - val_loss: 0.3949 - val_accuracy: 0.8517 - 237ms/epoch - 2ms/step Epoch 36/150 118/118 - 0s - loss: 0.3824 - accuracy: 0.8490 - val_loss: 0.3868 - val_accuracy: 0.8541 - 260ms/epoch - 2ms/step Epoch 37/150 118/118 - 0s - loss: 0.3893 - accuracy: 0.8492 - val_loss: 0.3891 - val_accuracy: 0.8541 - 240ms/epoch - 2ms/step Epoch 38/150 118/118 - 0s - loss: 0.3871 - accuracy: 0.8511 - val_loss: 0.3854 - val_accuracy: 0.8517 - 247ms/epoch - 2ms/step Epoch 39/150 118/118 - 0s - loss: 0.3811 - accuracy: 0.8554 - val_loss: 0.3890 - val_accuracy: 0.8541 - 253ms/epoch - 2ms/step Epoch 40/150 118/118 - 0s - loss: 0.3864 - accuracy: 0.8503 - val_loss: 0.3868 - val_accuracy: 0.8541 - 251ms/epoch - 2ms/step Epoch 41/150 118/118 - 0s - loss: 0.3746 - accuracy: 0.8559 - val_loss: 0.3890 - val_accuracy: 0.8517 - 245ms/epoch - 2ms/step Epoch 42/150 118/118 - 0s - loss: 0.3763 - accuracy: 0.8564 - val_loss: 0.3862 - val_accuracy: 0.8517 - 256ms/epoch - 2ms/step Epoch 43/150 118/118 - 0s - loss: 0.3806 - accuracy: 0.8551 - val_loss: 0.3835 - val_accuracy: 0.8541 - 251ms/epoch - 2ms/step Epoch 44/150 118/118 - 0s - loss: 0.3848 - accuracy: 0.8548 - val_loss: 0.3851 - val_accuracy: 0.8541 - 245ms/epoch - 2ms/step Epoch 45/150 118/118 - 0s - loss: 0.3770 - accuracy: 0.8543 - val_loss: 0.3805 - val_accuracy: 0.8517 - 241ms/epoch - 2ms/step Epoch 46/150 118/118 - 0s - loss: 0.3753 - accuracy: 0.8519 - val_loss: 0.3832 - val_accuracy: 0.8541 - 240ms/epoch - 2ms/step Epoch 47/150 118/118 - 0s - loss: 0.3693 - accuracy: 0.8559 - val_loss: 0.3778 - val_accuracy: 0.8517 - 243ms/epoch - 2ms/step Epoch 48/150 118/118 - 0s - loss: 0.3746 - accuracy: 0.8538 - val_loss: 0.3831 - val_accuracy: 0.8493 - 236ms/epoch - 2ms/step Epoch 49/150 118/118 - 0s - loss: 0.3713 - accuracy: 0.8527 - val_loss: 0.3804 - val_accuracy: 0.8517 - 245ms/epoch - 2ms/step Epoch 50/150 118/118 - 0s - loss: 0.3777 - accuracy: 0.8607 - val_loss: 0.3771 - val_accuracy: 0.8517 - 239ms/epoch - 2ms/step Epoch 51/150 118/118 - 0s - loss: 0.3719 - accuracy: 0.8564 - val_loss: 0.3846 - val_accuracy: 0.8517 - 244ms/epoch - 2ms/step Epoch 52/150 118/118 - 0s - loss: 0.3740 - accuracy: 0.8554 - val_loss: 0.3783 - val_accuracy: 0.8517 - 265ms/epoch - 2ms/step Epoch 53/150 118/118 - 0s - loss: 0.3746 - accuracy: 0.8580 - val_loss: 0.3759 - val_accuracy: 0.8541 - 259ms/epoch - 2ms/step Epoch 54/150 118/118 - 0s - loss: 0.3759 - accuracy: 0.8564 - val_loss: 0.3813 - val_accuracy: 0.8541 - 247ms/epoch - 2ms/step Epoch 55/150 118/118 - 0s - loss: 0.3712 - accuracy: 0.8556 - val_loss: 0.3798 - val_accuracy: 0.8517 - 250ms/epoch - 2ms/step Epoch 56/150 118/118 - 0s - loss: 0.3676 - accuracy: 0.8540 - val_loss: 0.3794 - val_accuracy: 0.8517 - 255ms/epoch - 2ms/step Epoch 57/150 118/118 - 0s - loss: 0.3718 - accuracy: 0.8564 - val_loss: 0.3745 - val_accuracy: 0.8517 - 252ms/epoch - 2ms/step Epoch 58/150 118/118 - 0s - loss: 0.3697 - accuracy: 0.8612 - val_loss: 0.3692 - val_accuracy: 0.8565 - 262ms/epoch - 2ms/step Epoch 59/150 118/118 - 0s - loss: 0.3730 - accuracy: 0.8556 - val_loss: 0.3760 - val_accuracy: 0.8493 - 251ms/epoch - 2ms/step Epoch 60/150 118/118 - 0s - loss: 0.3609 - accuracy: 0.8601 - val_loss: 0.3753 - val_accuracy: 0.8517 - 265ms/epoch - 2ms/step Epoch 61/150 118/118 - 0s - loss: 0.3637 - accuracy: 0.8580 - val_loss: 0.3776 - val_accuracy: 0.8517 - 240ms/epoch - 2ms/step Epoch 62/150 118/118 - 0s - loss: 0.3785 - accuracy: 0.8508 - val_loss: 0.3777 - val_accuracy: 0.8493 - 248ms/epoch - 2ms/step Epoch 63/150 118/118 - 0s - loss: 0.3690 - accuracy: 0.8538 - val_loss: 0.3753 - val_accuracy: 0.8493 - 244ms/epoch - 2ms/step Epoch 64/150 118/118 - 0s - loss: 0.3628 - accuracy: 0.8524 - val_loss: 0.3710 - val_accuracy: 0.8517 - 257ms/epoch - 2ms/step Epoch 65/150 118/118 - 0s - loss: 0.3666 - accuracy: 0.8601 - val_loss: 0.3729 - val_accuracy: 0.8517 - 246ms/epoch - 2ms/step Epoch 66/150 118/118 - 0s - loss: 0.3647 - accuracy: 0.8559 - val_loss: 0.3741 - val_accuracy: 0.8517 - 254ms/epoch - 2ms/step Epoch 67/150 118/118 - 0s - loss: 0.3610 - accuracy: 0.8609 - val_loss: 0.3657 - val_accuracy: 0.8517 - 247ms/epoch - 2ms/step Epoch 68/150 118/118 - 0s - loss: 0.3703 - accuracy: 0.8583 - val_loss: 0.3633 - val_accuracy: 0.8517 - 276ms/epoch - 2ms/step Epoch 69/150 118/118 - 0s - loss: 0.3567 - accuracy: 0.8604 - val_loss: 0.3654 - val_accuracy: 0.8517 - 251ms/epoch - 2ms/step Epoch 70/150 118/118 - 0s - loss: 0.3558 - accuracy: 0.8655 - val_loss: 0.3732 - val_accuracy: 0.8493 - 248ms/epoch - 2ms/step Epoch 71/150 118/118 - 0s - loss: 0.3672 - accuracy: 0.8548 - val_loss: 0.3663 - val_accuracy: 0.8517 - 248ms/epoch - 2ms/step Epoch 72/150 118/118 - 0s - loss: 0.3521 - accuracy: 0.8609 - val_loss: 0.3630 - val_accuracy: 0.8517 - 247ms/epoch - 2ms/step Epoch 73/150 118/118 - 0s - loss: 0.3619 - accuracy: 0.8575 - val_loss: 0.3666 - val_accuracy: 0.8517 - 253ms/epoch - 2ms/step Epoch 74/150 118/118 - 0s - loss: 0.3655 - accuracy: 0.8567 - val_loss: 0.3725 - val_accuracy: 0.8445 - 262ms/epoch - 2ms/step Epoch 75/150 118/118 - 0s - loss: 0.3523 - accuracy: 0.8556 - val_loss: 0.3647 - val_accuracy: 0.8493 - 247ms/epoch - 2ms/step Epoch 76/150 118/118 - 0s - loss: 0.3587 - accuracy: 0.8567 - val_loss: 0.3638 - val_accuracy: 0.8517 - 250ms/epoch - 2ms/step Epoch 77/150 118/118 - 0s - loss: 0.3553 - accuracy: 0.8591 - val_loss: 0.3597 - val_accuracy: 0.8517 - 238ms/epoch - 2ms/step Epoch 78/150 118/118 - 0s - loss: 0.3523 - accuracy: 0.8607 - val_loss: 0.3595 - val_accuracy: 0.8517 - 248ms/epoch - 2ms/step Epoch 79/150 118/118 - 0s - loss: 0.3649 - accuracy: 0.8516 - val_loss: 0.3636 - val_accuracy: 0.8517 - 233ms/epoch - 2ms/step Epoch 80/150 118/118 - 0s - loss: 0.3488 - accuracy: 0.8559 - val_loss: 0.3599 - val_accuracy: 0.8517 - 238ms/epoch - 2ms/step Epoch 81/150 118/118 - 0s - loss: 0.3537 - accuracy: 0.8567 - val_loss: 0.3598 - val_accuracy: 0.8517 - 236ms/epoch - 2ms/step Epoch 82/150 118/118 - 0s - loss: 0.3518 - accuracy: 0.8572 - val_loss: 0.3586 - val_accuracy: 0.8517 - 247ms/epoch - 2ms/step Epoch 83/150 118/118 - 0s - loss: 0.3514 - accuracy: 0.8604 - val_loss: 0.3572 - val_accuracy: 0.8517 - 243ms/epoch - 2ms/step Epoch 84/150 118/118 - 0s - loss: 0.3480 - accuracy: 0.8631 - val_loss: 0.3556 - val_accuracy: 0.8517 - 245ms/epoch - 2ms/step Epoch 85/150 118/118 - 0s - loss: 0.3505 - accuracy: 0.8609 - val_loss: 0.3611 - val_accuracy: 0.8469 - 233ms/epoch - 2ms/step Epoch 86/150 118/118 - 0s - loss: 0.3531 - accuracy: 0.8588 - val_loss: 0.3551 - val_accuracy: 0.8517 - 242ms/epoch - 2ms/step Epoch 87/150 118/118 - 0s - loss: 0.3434 - accuracy: 0.8660 - val_loss: 0.3583 - val_accuracy: 0.8517 - 241ms/epoch - 2ms/step Epoch 88/150 118/118 - 0s - loss: 0.3483 - accuracy: 0.8583 - val_loss: 0.3533 - val_accuracy: 0.8517 - 240ms/epoch - 2ms/step Epoch 89/150 118/118 - 0s - loss: 0.3499 - accuracy: 0.8586 - val_loss: 0.3562 - val_accuracy: 0.8517 - 237ms/epoch - 2ms/step Epoch 90/150 118/118 - 0s - loss: 0.3463 - accuracy: 0.8578 - val_loss: 0.3548 - val_accuracy: 0.8541 - 254ms/epoch - 2ms/step Epoch 91/150 118/118 - 0s - loss: 0.3385 - accuracy: 0.8671 - val_loss: 0.3549 - val_accuracy: 0.8541 - 241ms/epoch - 2ms/step Epoch 92/150 118/118 - 0s - loss: 0.3515 - accuracy: 0.8655 - val_loss: 0.3566 - val_accuracy: 0.8493 - 238ms/epoch - 2ms/step Epoch 93/150 118/118 - 0s - loss: 0.3479 - accuracy: 0.8596 - val_loss: 0.3570 - val_accuracy: 0.8541 - 237ms/epoch - 2ms/step Epoch 94/150 118/118 - 0s - loss: 0.3379 - accuracy: 0.8641 - val_loss: 0.3490 - val_accuracy: 0.8589 - 237ms/epoch - 2ms/step Epoch 95/150 118/118 - 0s - loss: 0.3506 - accuracy: 0.8580 - val_loss: 0.3556 - val_accuracy: 0.8565 - 235ms/epoch - 2ms/step Epoch 96/150 118/118 - 0s - loss: 0.3517 - accuracy: 0.8591 - val_loss: 0.3535 - val_accuracy: 0.8565 - 252ms/epoch - 2ms/step Epoch 97/150 118/118 - 0s - loss: 0.3415 - accuracy: 0.8625 - val_loss: 0.3493 - val_accuracy: 0.8589 - 241ms/epoch - 2ms/step Epoch 98/150 118/118 - 0s - loss: 0.3456 - accuracy: 0.8617 - val_loss: 0.3525 - val_accuracy: 0.8565 - 249ms/epoch - 2ms/step Epoch 99/150 118/118 - 0s - loss: 0.3398 - accuracy: 0.8583 - val_loss: 0.3469 - val_accuracy: 0.8589 - 252ms/epoch - 2ms/step Epoch 100/150 118/118 - 0s - loss: 0.3475 - accuracy: 0.8609 - val_loss: 0.3528 - val_accuracy: 0.8589 - 242ms/epoch - 2ms/step Epoch 101/150 118/118 - 0s - loss: 0.3417 - accuracy: 0.8615 - val_loss: 0.3501 - val_accuracy: 0.8589 - 236ms/epoch - 2ms/step Epoch 102/150 118/118 - 0s - loss: 0.3421 - accuracy: 0.8604 - val_loss: 0.3530 - val_accuracy: 0.8565 - 238ms/epoch - 2ms/step Epoch 103/150 118/118 - 0s - loss: 0.3419 - accuracy: 0.8612 - val_loss: 0.3495 - val_accuracy: 0.8589 - 235ms/epoch - 2ms/step Epoch 104/150 118/118 - 0s - loss: 0.3360 - accuracy: 0.8655 - val_loss: 0.3487 - val_accuracy: 0.8565 - 239ms/epoch - 2ms/step Epoch 105/150 118/118 - 0s - loss: 0.3431 - accuracy: 0.8599 - val_loss: 0.3474 - val_accuracy: 0.8589 - 239ms/epoch - 2ms/step Epoch 106/150 118/118 - 0s - loss: 0.3439 - accuracy: 0.8604 - val_loss: 0.3524 - val_accuracy: 0.8589 - 244ms/epoch - 2ms/step Epoch 107/150 118/118 - 0s - loss: 0.3354 - accuracy: 0.8663 - val_loss: 0.3483 - val_accuracy: 0.8612 - 250ms/epoch - 2ms/step Epoch 108/150 118/118 - 0s - loss: 0.3426 - accuracy: 0.8657 - val_loss: 0.3499 - val_accuracy: 0.8541 - 244ms/epoch - 2ms/step Epoch 109/150 118/118 - 0s - loss: 0.3428 - accuracy: 0.8562 - val_loss: 0.3468 - val_accuracy: 0.8565 - 253ms/epoch - 2ms/step Epoch 110/150 118/118 - 0s - loss: 0.3346 - accuracy: 0.8628 - val_loss: 0.3443 - val_accuracy: 0.8565 - 244ms/epoch - 2ms/step Epoch 111/150 118/118 - 0s - loss: 0.3428 - accuracy: 0.8617 - val_loss: 0.3497 - val_accuracy: 0.8541 - 251ms/epoch - 2ms/step Epoch 112/150 118/118 - 0s - loss: 0.3364 - accuracy: 0.8655 - val_loss: 0.3498 - val_accuracy: 0.8541 - 249ms/epoch - 2ms/step Epoch 113/150 118/118 - 0s - loss: 0.3337 - accuracy: 0.8655 - val_loss: 0.3463 - val_accuracy: 0.8565 - 262ms/epoch - 2ms/step Epoch 114/150 118/118 - 0s - loss: 0.3375 - accuracy: 0.8633 - val_loss: 0.3463 - val_accuracy: 0.8565 - 262ms/epoch - 2ms/step Epoch 115/150 118/118 - 0s - loss: 0.3323 - accuracy: 0.8697 - val_loss: 0.3483 - val_accuracy: 0.8517 - 239ms/epoch - 2ms/step Epoch 116/150 118/118 - 0s - loss: 0.3314 - accuracy: 0.8705 - val_loss: 0.3447 - val_accuracy: 0.8517 - 243ms/epoch - 2ms/step Epoch 117/150 118/118 - 0s - loss: 0.3424 - accuracy: 0.8620 - val_loss: 0.3423 - val_accuracy: 0.8612 - 263ms/epoch - 2ms/step Epoch 118/150 118/118 - 0s - loss: 0.3416 - accuracy: 0.8586 - val_loss: 0.3453 - val_accuracy: 0.8565 - 261ms/epoch - 2ms/step Epoch 119/150 118/118 - 0s - loss: 0.3312 - accuracy: 0.8668 - val_loss: 0.3415 - val_accuracy: 0.8636 - 262ms/epoch - 2ms/step Epoch 120/150 118/118 - 0s - loss: 0.3341 - accuracy: 0.8625 - val_loss: 0.3436 - val_accuracy: 0.8541 - 252ms/epoch - 2ms/step Epoch 121/150 118/118 - 0s - loss: 0.3294 - accuracy: 0.8681 - val_loss: 0.3410 - val_accuracy: 0.8636 - 241ms/epoch - 2ms/step Epoch 122/150 118/118 - 0s - loss: 0.3278 - accuracy: 0.8689 - val_loss: 0.3428 - val_accuracy: 0.8612 - 259ms/epoch - 2ms/step Epoch 123/150 118/118 - 0s - loss: 0.3303 - accuracy: 0.8673 - val_loss: 0.3411 - val_accuracy: 0.8589 - 251ms/epoch - 2ms/step Epoch 124/150 118/118 - 0s - loss: 0.3231 - accuracy: 0.8594 - val_loss: 0.3355 - val_accuracy: 0.8636 - 261ms/epoch - 2ms/step Epoch 125/150 118/118 - 0s - loss: 0.3361 - accuracy: 0.8679 - val_loss: 0.3421 - val_accuracy: 0.8541 - 245ms/epoch - 2ms/step Epoch 126/150 118/118 - 0s - loss: 0.3254 - accuracy: 0.8652 - val_loss: 0.3389 - val_accuracy: 0.8636 - 245ms/epoch - 2ms/step Epoch 127/150 118/118 - 0s - loss: 0.3386 - accuracy: 0.8615 - val_loss: 0.3387 - val_accuracy: 0.8612 - 251ms/epoch - 2ms/step Epoch 128/150 118/118 - 0s - loss: 0.3329 - accuracy: 0.8665 - val_loss: 0.3408 - val_accuracy: 0.8565 - 256ms/epoch - 2ms/step Epoch 129/150 118/118 - 0s - loss: 0.3260 - accuracy: 0.8679 - val_loss: 0.3433 - val_accuracy: 0.8517 - 231ms/epoch - 2ms/step Epoch 130/150 118/118 - 0s - loss: 0.3258 - accuracy: 0.8591 - val_loss: 0.3419 - val_accuracy: 0.8517 - 238ms/epoch - 2ms/step Epoch 131/150 118/118 - 0s - loss: 0.3364 - accuracy: 0.8636 - val_loss: 0.3381 - val_accuracy: 0.8565 - 240ms/epoch - 2ms/step Epoch 132/150 118/118 - 0s - loss: 0.3301 - accuracy: 0.8657 - val_loss: 0.3376 - val_accuracy: 0.8612 - 236ms/epoch - 2ms/step Epoch 133/150 118/118 - 0s - loss: 0.3284 - accuracy: 0.8628 - val_loss: 0.3431 - val_accuracy: 0.8589 - 235ms/epoch - 2ms/step Epoch 134/150 118/118 - 0s - loss: 0.3285 - accuracy: 0.8673 - val_loss: 0.3386 - val_accuracy: 0.8589 - 269ms/epoch - 2ms/step Epoch 135/150 118/118 - 0s - loss: 0.3322 - accuracy: 0.8641 - val_loss: 0.3450 - val_accuracy: 0.8541 - 314ms/epoch - 3ms/step Epoch 136/150 118/118 - 0s - loss: 0.3300 - accuracy: 0.8601 - val_loss: 0.3362 - val_accuracy: 0.8636 - 322ms/epoch - 3ms/step Epoch 137/150 118/118 - 0s - loss: 0.3405 - accuracy: 0.8623 - val_loss: 0.3385 - val_accuracy: 0.8589 - 320ms/epoch - 3ms/step Epoch 138/150 118/118 - 0s - loss: 0.3219 - accuracy: 0.8687 - val_loss: 0.3348 - val_accuracy: 0.8612 - 271ms/epoch - 2ms/step Epoch 139/150 118/118 - 0s - loss: 0.3289 - accuracy: 0.8636 - val_loss: 0.3315 - val_accuracy: 0.8612 - 265ms/epoch - 2ms/step Epoch 140/150 118/118 - 0s - loss: 0.3313 - accuracy: 0.8689 - val_loss: 0.3386 - val_accuracy: 0.8565 - 255ms/epoch - 2ms/step Epoch 141/150 118/118 - 0s - loss: 0.3310 - accuracy: 0.8636 - val_loss: 0.3393 - val_accuracy: 0.8612 - 243ms/epoch - 2ms/step Epoch 142/150 118/118 - 0s - loss: 0.3321 - accuracy: 0.8641 - val_loss: 0.3408 - val_accuracy: 0.8612 - 236ms/epoch - 2ms/step Epoch 143/150 118/118 - 0s - loss: 0.3325 - accuracy: 0.8676 - val_loss: 0.3395 - val_accuracy: 0.8612 - 246ms/epoch - 2ms/step Epoch 144/150 118/118 - 0s - loss: 0.3269 - accuracy: 0.8652 - val_loss: 0.3371 - val_accuracy: 0.8589 - 242ms/epoch - 2ms/step Epoch 145/150 118/118 - 0s - loss: 0.3309 - accuracy: 0.8599 - val_loss: 0.3347 - val_accuracy: 0.8565 - 253ms/epoch - 2ms/step Epoch 146/150 118/118 - 0s - loss: 0.3265 - accuracy: 0.8612 - val_loss: 0.3372 - val_accuracy: 0.8589 - 240ms/epoch - 2ms/step Epoch 147/150 118/118 - 0s - loss: 0.3239 - accuracy: 0.8695 - val_loss: 0.3279 - val_accuracy: 0.8660 - 249ms/epoch - 2ms/step Epoch 148/150 118/118 - 0s - loss: 0.3265 - accuracy: 0.8671 - val_loss: 0.3309 - val_accuracy: 0.8565 - 242ms/epoch - 2ms/step Epoch 149/150 118/118 - 0s - loss: 0.3396 - accuracy: 0.8631 - val_loss: 0.3378 - val_accuracy: 0.8612 - 242ms/epoch - 2ms/step Epoch 150/150 118/118 - 0s - loss: 0.3257 - accuracy: 0.8655 - val_loss: 0.3307 - val_accuracy: 0.8565 - 246ms/epoch - 2ms/step
# Plotting Accuracy vs Epochs
plt.plot(history_1.history['accuracy'])
plt.plot(history_1.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
Observation:
The plot shows a smooth convergence of train and validation errors at a relatively high accuracy.
# Checking performance on the training data
train_loss, train_accuracy = ann_model.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the train data
y_pred_train_ann = ann_model.predict(X_train_scaled)
y_pred_train_ann = (y_pred_train_ann > 0.5).astype(int) # Convert probabilities to binary predictions
# Confusion matrix for the train data
metrics_score(y_train, y_pred_train_ann)
131/131 [==============================] - 0s 1ms/step - loss: 0.2929 - accuracy: 0.8840
Train Accuracy: 88.40%
56/56 [==============================] - 0s 1ms/step - loss: 0.3263 - accuracy: 0.8725
Test Accuracy: 87.25%
131/131 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.97 0.89 0.92 3340
1 0.66 0.87 0.75 832
accuracy 0.88 4172
macro avg 0.81 0.88 0.84 4172
weighted avg 0.90 0.88 0.89 4172
Observation:
Overall performance on the train data is relatively strong, including the recall on class 1.
# Checking performance on the training data
train_loss, train_accuracy = ann_model.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the test data
y_pred_test_ann = ann_model.predict(X_test_scaled)
y_pred_test_ann = (y_pred_test_ann > 0.5).astype(int)
# Confusion matrix for the test data
metrics_score(y_test, y_pred_test_ann)
131/131 [==============================] - 0s 1ms/step - loss: 0.2929 - accuracy: 0.8840
Train Accuracy: 88.40%
56/56 [==============================] - 0s 2ms/step - loss: 0.3263 - accuracy: 0.8725
Test Accuracy: 87.25%
56/56 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.95 0.89 0.92 1431
1 0.64 0.81 0.72 357
accuracy 0.87 1788
macro avg 0.80 0.85 0.82 1788
weighted avg 0.89 0.87 0.88 1788
Observation:
The recall for class 1 at 0.81 is relativaly robust, while maintaininng a solid performance. Let's try to achieve even better results by hyperparameter tuning.
# Clearing the backend
K.clear_session()
# Fixing the seed for random number generators
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
# Define a custom loss function that takes into account the class imbalance
def custom_loss(y_true, y_pred):
y_true = tf.cast(y_true, tf.float32) # Cast y_true to float32
# Constants
beta = 8 # Weight for the minority class (class 1)
alpha = 2 # Weight for the majority class (class 0)
# Calculate binary cross-entropy loss
bce = K.binary_crossentropy(y_true, y_pred)
# Apply weights
weight_vector = y_true * beta + (1 - y_true) * alpha
weighted_bce = weight_vector * bce
return K.mean(weighted_bce)
# Hyperparameter tuning with Keras Tuner
# Function to create the model (accepts hyperparameters as arguments)
def create_model(learn_rate=0.01, neurons=256, dropout_rate=0.3):
model = Sequential()
model.add(Dense(neurons, activation='relu', input_dim=X_train_scaled.shape[1]))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons // 2, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons // 4, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
# Compile the model with standard recall metric
optimizer = tf.keras.optimizers.Adamax(learning_rate=learn_rate)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy'])
return model
# Wrap the model with KerasClassifier
keras_estimator = KerasClassifier(model=create_model, verbose=1)
# Define the grid search parameters
param_random = {
'model__learn_rate': [0.01, 0.05, 0.001],
'model__neurons': [40, 50, 60, 70, 80],
'model__dropout_rate': [0.2, 0.3, 0.4],
'batch_size': [32, 64, 128]
}
# RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=keras_estimator, param_distributions=param_random,
n_iter=25, cv=5, verbose=2, n_jobs=-1, random_state=1)
# Fitting the model
random_search_result = random_search.fit(X_train_scaled, y_train, validation_split=0.2)
# Best parameters
print("Best: %f using %s" % (random_search_result.best_score_, random_search_result.best_params_))
Fitting 5 folds for each of 25 candidates, totalling 125 fits
53/53 [==============================] - 1s 7ms/step - loss: 1.8608 - accuracy: 0.7633 - val_loss: 1.5219 - val_accuracy: 0.8551
Best: 0.851386 using {'model__neurons': 80, 'model__learn_rate': 0.05, 'model__dropout_rate': 0.3, 'batch_size': 64}
# Check the optimized hyperparameters
random_search_result.best_params_
{'model__neurons': 80,
'model__learn_rate': 0.05,
'model__dropout_rate': 0.3,
'batch_size': 64}
Build ANN model with optimized hyperparameters
# Use the best hyperparameters found by RandomizedSearchCV
best_learn_rate = random_search_result.best_params_['model__learn_rate']
best_batch_size = random_search_result.best_params_['batch_size']
# Create a new model with these best hyperparameters
ann_model_tuned = create_model(learn_rate=best_learn_rate)
# Compile the model
ann_model_tuned.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adamax(learning_rate=best_learn_rate), metrics=['accuracy'])
# Fit the model to the training data
history_2 = ann_model_tuned.fit(X_train_scaled, y_train, epochs=50, batch_size=best_batch_size, verbose=1, validation_split=0.2, class_weight=class_weights_dict)
Epoch 1/50 53/53 [==============================] - 1s 8ms/step - loss: 2.1458 - accuracy: 0.7740 - val_loss: 1.4902 - val_accuracy: 0.8551 Epoch 2/50 53/53 [==============================] - 0s 4ms/step - loss: 1.4751 - accuracy: 0.8367 - val_loss: 1.5065 - val_accuracy: 0.8587 Epoch 3/50 53/53 [==============================] - 0s 4ms/step - loss: 1.3922 - accuracy: 0.8472 - val_loss: 1.3845 - val_accuracy: 0.8479 Epoch 4/50 53/53 [==============================] - 0s 4ms/step - loss: 1.3153 - accuracy: 0.8559 - val_loss: 1.4040 - val_accuracy: 0.8443 Epoch 5/50 53/53 [==============================] - 0s 4ms/step - loss: 1.2871 - accuracy: 0.8556 - val_loss: 1.3304 - val_accuracy: 0.8311 Epoch 6/50 53/53 [==============================] - 0s 4ms/step - loss: 1.2640 - accuracy: 0.8538 - val_loss: 1.3579 - val_accuracy: 0.8599 Epoch 7/50 53/53 [==============================] - 0s 4ms/step - loss: 1.2150 - accuracy: 0.8520 - val_loss: 1.4822 - val_accuracy: 0.8766 Epoch 8/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1817 - accuracy: 0.8636 - val_loss: 1.3123 - val_accuracy: 0.8251 Epoch 9/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1822 - accuracy: 0.8514 - val_loss: 1.3233 - val_accuracy: 0.8491 Epoch 10/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1149 - accuracy: 0.8565 - val_loss: 1.4606 - val_accuracy: 0.8623 Epoch 11/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1216 - accuracy: 0.8550 - val_loss: 1.3381 - val_accuracy: 0.8551 Epoch 12/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0711 - accuracy: 0.8651 - val_loss: 1.2972 - val_accuracy: 0.8719 Epoch 13/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0959 - accuracy: 0.8762 - val_loss: 1.3000 - val_accuracy: 0.8743 Epoch 14/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0221 - accuracy: 0.8735 - val_loss: 1.3742 - val_accuracy: 0.8719 Epoch 15/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0329 - accuracy: 0.8690 - val_loss: 1.2316 - val_accuracy: 0.8754 Epoch 16/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9523 - accuracy: 0.8732 - val_loss: 1.1046 - val_accuracy: 0.8611 Epoch 17/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0083 - accuracy: 0.8777 - val_loss: 1.1749 - val_accuracy: 0.8539 Epoch 18/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0259 - accuracy: 0.8750 - val_loss: 1.1257 - val_accuracy: 0.8719 Epoch 19/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9404 - accuracy: 0.8783 - val_loss: 1.2668 - val_accuracy: 0.8683 Epoch 20/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9070 - accuracy: 0.8819 - val_loss: 1.2154 - val_accuracy: 0.8778 Epoch 21/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8828 - accuracy: 0.8843 - val_loss: 1.0326 - val_accuracy: 0.8719 Epoch 22/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8851 - accuracy: 0.8819 - val_loss: 1.2200 - val_accuracy: 0.8527 Epoch 23/50 53/53 [==============================] - 0s 5ms/step - loss: 0.9287 - accuracy: 0.8801 - val_loss: 1.0958 - val_accuracy: 0.8790 Epoch 24/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8190 - accuracy: 0.8876 - val_loss: 1.1305 - val_accuracy: 0.8707 Epoch 25/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8551 - accuracy: 0.8873 - val_loss: 1.3355 - val_accuracy: 0.8826 Epoch 26/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8065 - accuracy: 0.8834 - val_loss: 1.2795 - val_accuracy: 0.8635 Epoch 27/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8061 - accuracy: 0.8732 - val_loss: 1.1226 - val_accuracy: 0.8850 Epoch 28/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8024 - accuracy: 0.8861 - val_loss: 1.2578 - val_accuracy: 0.8790 Epoch 29/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7546 - accuracy: 0.8828 - val_loss: 1.0510 - val_accuracy: 0.8659 Epoch 30/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8060 - accuracy: 0.8801 - val_loss: 1.1265 - val_accuracy: 0.8623 Epoch 31/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8021 - accuracy: 0.8855 - val_loss: 1.1158 - val_accuracy: 0.8635 Epoch 32/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7611 - accuracy: 0.8885 - val_loss: 1.2488 - val_accuracy: 0.8335 Epoch 33/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7603 - accuracy: 0.8873 - val_loss: 1.1818 - val_accuracy: 0.8635 Epoch 34/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7041 - accuracy: 0.9014 - val_loss: 1.1856 - val_accuracy: 0.8659 Epoch 35/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7216 - accuracy: 0.8939 - val_loss: 1.3781 - val_accuracy: 0.8790 Epoch 36/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6980 - accuracy: 0.9062 - val_loss: 1.8307 - val_accuracy: 0.8898 Epoch 37/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7717 - accuracy: 0.8963 - val_loss: 1.5206 - val_accuracy: 0.8814 Epoch 38/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7234 - accuracy: 0.8930 - val_loss: 1.2036 - val_accuracy: 0.8731 Epoch 39/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7942 - accuracy: 0.8861 - val_loss: 1.3698 - val_accuracy: 0.8743 Epoch 40/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6676 - accuracy: 0.9074 - val_loss: 1.1882 - val_accuracy: 0.8814 Epoch 41/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6620 - accuracy: 0.9074 - val_loss: 1.4530 - val_accuracy: 0.8874 Epoch 42/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6605 - accuracy: 0.9041 - val_loss: 1.4732 - val_accuracy: 0.8898 Epoch 43/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6566 - accuracy: 0.9074 - val_loss: 1.4422 - val_accuracy: 0.8802 Epoch 44/50 53/53 [==============================] - 0s 5ms/step - loss: 0.6332 - accuracy: 0.9086 - val_loss: 1.7906 - val_accuracy: 0.8898 Epoch 45/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6161 - accuracy: 0.9044 - val_loss: 1.7450 - val_accuracy: 0.8838 Epoch 46/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6631 - accuracy: 0.9053 - val_loss: 1.4241 - val_accuracy: 0.8814 Epoch 47/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6819 - accuracy: 0.9080 - val_loss: 1.1808 - val_accuracy: 0.8754 Epoch 48/50 53/53 [==============================] - 0s 5ms/step - loss: 0.5932 - accuracy: 0.9113 - val_loss: 1.6139 - val_accuracy: 0.8826 Epoch 49/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6196 - accuracy: 0.9083 - val_loss: 1.4450 - val_accuracy: 0.8647 Epoch 50/50 53/53 [==============================] - 0s 4ms/step - loss: 0.5877 - accuracy: 0.9092 - val_loss: 1.5746 - val_accuracy: 0.8539
# Plotting Accuracy vs Epoch
plt.plot(history_2.history['accuracy'])
plt.plot(history_2.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
Observation:
Overall accuracy increases as the epochs increase. In later epochs, there may be a slight divergence between the training and validation metrics, suggesting that the model could be starting to overfit the training data.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_tuned.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_tuned.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the train data
y_pred_train_ann_tuned = ann_model_tuned.predict(X_train_scaled)
y_pred_train_ann_tuned = (y_pred_train_ann_tuned > 0.5).astype(int)
# Confusion matrix for the train data
metrics_score(y_train, y_pred_train_ann_tuned)
131/131 [==============================] - 0s 2ms/step - loss: 0.6278 - accuracy: 0.9070
Train Accuracy: 90.70%
56/56 [==============================] - 0s 2ms/step - loss: 1.8005 - accuracy: 0.8775
Test Accuracy: 87.75%
131/131 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.99 0.89 0.94 3340
1 0.69 0.96 0.80 832
accuracy 0.91 4172
macro avg 0.84 0.93 0.87 4172
weighted avg 0.93 0.91 0.91 4172
Observation:
The recall for class 1 on the train data is particularly high, indicating that the model is effective at identifying most of the positive cases in the training data.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_tuned.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_tuned.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the test data
y_pred_test_ann_tuned = ann_model_tuned.predict(X_test_scaled)
y_pred_test_ann_tuned = (y_pred_test_ann_tuned > 0.5).astype(int)
# Confusion matrix for the test data
metrics_score(y_test, y_pred_test_ann_tuned)
131/131 [==============================] - 0s 2ms/step - loss: 0.6278 - accuracy: 0.9070
Train Accuracy: 90.70%
56/56 [==============================] - 0s 2ms/step - loss: 1.8005 - accuracy: 0.8775
Test Accuracy: 87.75%
56/56 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.96 0.89 0.92 1431
1 0.65 0.84 0.73 357
accuracy 0.88 1788
macro avg 0.80 0.86 0.83 1788
weighted avg 0.90 0.88 0.88 1788
Observations:
We can also check optimizers to see which one performs better with our data:
# Clearing the backend
K.clear_session()
# Fixing the seed for random number generators
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
# Model architecture
ann_model_nadam = Sequential()
ann_model_nadam.add(Dense(50, activation='relu', input_shape=(27,)))
ann_model_nadam.add(Dropout(0.2))
ann_model_nadam.add(Dense(12, activation='relu'))
ann_model_nadam.add(Dropout(0.1))
ann_model_nadam.add(Dense(1, activation='sigmoid'))
# Compile the model
ann_model_nadam.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
# Model summary
ann_model_nadam.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 50) 1400
dropout (Dropout) (None, 50) 0
dense_1 (Dense) (None, 12) 612
dropout_1 (Dropout) (None, 12) 0
dense_2 (Dense) (None, 1) 13
=================================================================
Total params: 2025 (7.91 KB)
Trainable params: 2025 (7.91 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
# Set class weights
class_weights = class_weight.compute_class_weight(
'balanced',
classes=np.unique(y_train),
y=y_train
)
class_weights_dict = dict(enumerate(class_weights))
# Fit the model
history_3 = ann_model_nadam.fit(
X_train_scaled, y_train,
validation_split=0.1,
epochs=150,
class_weight=class_weights_dict,
verbose=2
)
Epoch 1/150 118/118 - 2s - loss: 0.6379 - accuracy: 0.6838 - val_loss: 0.5719 - val_accuracy: 0.7464 - 2s/epoch - 15ms/step Epoch 2/150 118/118 - 0s - loss: 0.4973 - accuracy: 0.7914 - val_loss: 0.4676 - val_accuracy: 0.8134 - 266ms/epoch - 2ms/step Epoch 3/150 118/118 - 0s - loss: 0.4478 - accuracy: 0.8261 - val_loss: 0.4321 - val_accuracy: 0.8469 - 271ms/epoch - 2ms/step Epoch 4/150 118/118 - 0s - loss: 0.4278 - accuracy: 0.8383 - val_loss: 0.4173 - val_accuracy: 0.8493 - 282ms/epoch - 2ms/step Epoch 5/150 118/118 - 0s - loss: 0.4115 - accuracy: 0.8527 - val_loss: 0.4236 - val_accuracy: 0.8421 - 271ms/epoch - 2ms/step Epoch 6/150 118/118 - 0s - loss: 0.4091 - accuracy: 0.8474 - val_loss: 0.3974 - val_accuracy: 0.8541 - 276ms/epoch - 2ms/step Epoch 7/150 118/118 - 0s - loss: 0.3942 - accuracy: 0.8548 - val_loss: 0.3996 - val_accuracy: 0.8469 - 255ms/epoch - 2ms/step Epoch 8/150 118/118 - 0s - loss: 0.4059 - accuracy: 0.8420 - val_loss: 0.3907 - val_accuracy: 0.8541 - 254ms/epoch - 2ms/step Epoch 9/150 118/118 - 0s - loss: 0.3891 - accuracy: 0.8559 - val_loss: 0.3914 - val_accuracy: 0.8469 - 263ms/epoch - 2ms/step Epoch 10/150 118/118 - 0s - loss: 0.3818 - accuracy: 0.8500 - val_loss: 0.3873 - val_accuracy: 0.8541 - 263ms/epoch - 2ms/step Epoch 11/150 118/118 - 0s - loss: 0.3823 - accuracy: 0.8476 - val_loss: 0.3709 - val_accuracy: 0.8517 - 267ms/epoch - 2ms/step Epoch 12/150 118/118 - 0s - loss: 0.3756 - accuracy: 0.8580 - val_loss: 0.3936 - val_accuracy: 0.8469 - 267ms/epoch - 2ms/step Epoch 13/150 118/118 - 0s - loss: 0.3783 - accuracy: 0.8500 - val_loss: 0.3588 - val_accuracy: 0.8541 - 272ms/epoch - 2ms/step Epoch 14/150 118/118 - 0s - loss: 0.3711 - accuracy: 0.8548 - val_loss: 0.3813 - val_accuracy: 0.8517 - 266ms/epoch - 2ms/step Epoch 15/150 118/118 - 0s - loss: 0.3652 - accuracy: 0.8570 - val_loss: 0.3711 - val_accuracy: 0.8565 - 264ms/epoch - 2ms/step Epoch 16/150 118/118 - 0s - loss: 0.3717 - accuracy: 0.8490 - val_loss: 0.3932 - val_accuracy: 0.8373 - 263ms/epoch - 2ms/step Epoch 17/150 118/118 - 0s - loss: 0.3676 - accuracy: 0.8570 - val_loss: 0.3680 - val_accuracy: 0.8517 - 280ms/epoch - 2ms/step Epoch 18/150 118/118 - 0s - loss: 0.3607 - accuracy: 0.8575 - val_loss: 0.3435 - val_accuracy: 0.8589 - 274ms/epoch - 2ms/step Epoch 19/150 118/118 - 0s - loss: 0.3429 - accuracy: 0.8652 - val_loss: 0.3590 - val_accuracy: 0.8493 - 281ms/epoch - 2ms/step Epoch 20/150 118/118 - 0s - loss: 0.3456 - accuracy: 0.8607 - val_loss: 0.3871 - val_accuracy: 0.8301 - 262ms/epoch - 2ms/step Epoch 21/150 118/118 - 0s - loss: 0.3577 - accuracy: 0.8583 - val_loss: 0.3441 - val_accuracy: 0.8565 - 264ms/epoch - 2ms/step Epoch 22/150 118/118 - 0s - loss: 0.3482 - accuracy: 0.8628 - val_loss: 0.3599 - val_accuracy: 0.8421 - 257ms/epoch - 2ms/step Epoch 23/150 118/118 - 0s - loss: 0.3475 - accuracy: 0.8663 - val_loss: 0.3584 - val_accuracy: 0.8493 - 247ms/epoch - 2ms/step Epoch 24/150 118/118 - 0s - loss: 0.3455 - accuracy: 0.8601 - val_loss: 0.3484 - val_accuracy: 0.8541 - 252ms/epoch - 2ms/step Epoch 25/150 118/118 - 0s - loss: 0.3387 - accuracy: 0.8663 - val_loss: 0.3598 - val_accuracy: 0.8541 - 260ms/epoch - 2ms/step Epoch 26/150 118/118 - 0s - loss: 0.3377 - accuracy: 0.8647 - val_loss: 0.3632 - val_accuracy: 0.8517 - 264ms/epoch - 2ms/step Epoch 27/150 118/118 - 0s - loss: 0.3356 - accuracy: 0.8647 - val_loss: 0.3631 - val_accuracy: 0.8421 - 280ms/epoch - 2ms/step Epoch 28/150 118/118 - 0s - loss: 0.3366 - accuracy: 0.8665 - val_loss: 0.3645 - val_accuracy: 0.8373 - 276ms/epoch - 2ms/step Epoch 29/150 118/118 - 0s - loss: 0.3316 - accuracy: 0.8572 - val_loss: 0.3252 - val_accuracy: 0.8541 - 266ms/epoch - 2ms/step Epoch 30/150 118/118 - 0s - loss: 0.3309 - accuracy: 0.8721 - val_loss: 0.3358 - val_accuracy: 0.8612 - 265ms/epoch - 2ms/step Epoch 31/150 118/118 - 0s - loss: 0.3238 - accuracy: 0.8721 - val_loss: 0.3510 - val_accuracy: 0.8493 - 256ms/epoch - 2ms/step Epoch 32/150 118/118 - 0s - loss: 0.3240 - accuracy: 0.8628 - val_loss: 0.3194 - val_accuracy: 0.8636 - 265ms/epoch - 2ms/step Epoch 33/150 118/118 - 0s - loss: 0.3270 - accuracy: 0.8703 - val_loss: 0.3414 - val_accuracy: 0.8517 - 260ms/epoch - 2ms/step Epoch 34/150 118/118 - 0s - loss: 0.3209 - accuracy: 0.8649 - val_loss: 0.3168 - val_accuracy: 0.8636 - 267ms/epoch - 2ms/step Epoch 35/150 118/118 - 0s - loss: 0.3241 - accuracy: 0.8607 - val_loss: 0.3180 - val_accuracy: 0.8660 - 278ms/epoch - 2ms/step Epoch 36/150 118/118 - 0s - loss: 0.3156 - accuracy: 0.8708 - val_loss: 0.3211 - val_accuracy: 0.8565 - 281ms/epoch - 2ms/step Epoch 37/150 118/118 - 0s - loss: 0.3245 - accuracy: 0.8673 - val_loss: 0.3365 - val_accuracy: 0.8565 - 264ms/epoch - 2ms/step Epoch 38/150 118/118 - 0s - loss: 0.3190 - accuracy: 0.8681 - val_loss: 0.3465 - val_accuracy: 0.8397 - 263ms/epoch - 2ms/step Epoch 39/150 118/118 - 0s - loss: 0.3145 - accuracy: 0.8652 - val_loss: 0.3483 - val_accuracy: 0.8493 - 285ms/epoch - 2ms/step Epoch 40/150 118/118 - 0s - loss: 0.3213 - accuracy: 0.8692 - val_loss: 0.3204 - val_accuracy: 0.8660 - 283ms/epoch - 2ms/step Epoch 41/150 118/118 - 0s - loss: 0.3063 - accuracy: 0.8713 - val_loss: 0.3293 - val_accuracy: 0.8589 - 280ms/epoch - 2ms/step Epoch 42/150 118/118 - 0s - loss: 0.3129 - accuracy: 0.8703 - val_loss: 0.3297 - val_accuracy: 0.8517 - 273ms/epoch - 2ms/step Epoch 43/150 118/118 - 0s - loss: 0.3163 - accuracy: 0.8673 - val_loss: 0.3144 - val_accuracy: 0.8612 - 269ms/epoch - 2ms/step Epoch 44/150 118/118 - 0s - loss: 0.3185 - accuracy: 0.8687 - val_loss: 0.3237 - val_accuracy: 0.8589 - 251ms/epoch - 2ms/step Epoch 45/150 118/118 - 0s - loss: 0.3064 - accuracy: 0.8705 - val_loss: 0.3065 - val_accuracy: 0.8589 - 246ms/epoch - 2ms/step Epoch 46/150 118/118 - 0s - loss: 0.3055 - accuracy: 0.8708 - val_loss: 0.3291 - val_accuracy: 0.8517 - 248ms/epoch - 2ms/step Epoch 47/150 118/118 - 0s - loss: 0.3090 - accuracy: 0.8679 - val_loss: 0.3286 - val_accuracy: 0.8565 - 259ms/epoch - 2ms/step Epoch 48/150 118/118 - 0s - loss: 0.3090 - accuracy: 0.8684 - val_loss: 0.3245 - val_accuracy: 0.8493 - 248ms/epoch - 2ms/step Epoch 49/150 118/118 - 0s - loss: 0.3091 - accuracy: 0.8719 - val_loss: 0.3247 - val_accuracy: 0.8517 - 262ms/epoch - 2ms/step Epoch 50/150 118/118 - 0s - loss: 0.3075 - accuracy: 0.8751 - val_loss: 0.3173 - val_accuracy: 0.8589 - 251ms/epoch - 2ms/step Epoch 51/150 118/118 - 0s - loss: 0.2982 - accuracy: 0.8716 - val_loss: 0.3361 - val_accuracy: 0.8493 - 250ms/epoch - 2ms/step Epoch 52/150 118/118 - 0s - loss: 0.3023 - accuracy: 0.8721 - val_loss: 0.3028 - val_accuracy: 0.8684 - 247ms/epoch - 2ms/step Epoch 53/150 118/118 - 0s - loss: 0.3061 - accuracy: 0.8697 - val_loss: 0.3090 - val_accuracy: 0.8612 - 247ms/epoch - 2ms/step Epoch 54/150 118/118 - 0s - loss: 0.3071 - accuracy: 0.8665 - val_loss: 0.3258 - val_accuracy: 0.8541 - 258ms/epoch - 2ms/step Epoch 55/150 118/118 - 0s - loss: 0.3007 - accuracy: 0.8705 - val_loss: 0.3252 - val_accuracy: 0.8493 - 262ms/epoch - 2ms/step Epoch 56/150 118/118 - 0s - loss: 0.2913 - accuracy: 0.8719 - val_loss: 0.3149 - val_accuracy: 0.8565 - 252ms/epoch - 2ms/step Epoch 57/150 118/118 - 0s - loss: 0.2992 - accuracy: 0.8724 - val_loss: 0.3056 - val_accuracy: 0.8517 - 260ms/epoch - 2ms/step Epoch 58/150 118/118 - 0s - loss: 0.2978 - accuracy: 0.8719 - val_loss: 0.3050 - val_accuracy: 0.8660 - 267ms/epoch - 2ms/step Epoch 59/150 118/118 - 0s - loss: 0.2977 - accuracy: 0.8711 - val_loss: 0.3160 - val_accuracy: 0.8636 - 256ms/epoch - 2ms/step Epoch 60/150 118/118 - 0s - loss: 0.2932 - accuracy: 0.8713 - val_loss: 0.3105 - val_accuracy: 0.8589 - 263ms/epoch - 2ms/step Epoch 61/150 118/118 - 0s - loss: 0.2902 - accuracy: 0.8708 - val_loss: 0.3107 - val_accuracy: 0.8565 - 261ms/epoch - 2ms/step Epoch 62/150 118/118 - 0s - loss: 0.3017 - accuracy: 0.8713 - val_loss: 0.3229 - val_accuracy: 0.8469 - 261ms/epoch - 2ms/step Epoch 63/150 118/118 - 0s - loss: 0.2951 - accuracy: 0.8716 - val_loss: 0.3076 - val_accuracy: 0.8565 - 254ms/epoch - 2ms/step Epoch 64/150 118/118 - 0s - loss: 0.2870 - accuracy: 0.8697 - val_loss: 0.2945 - val_accuracy: 0.8684 - 254ms/epoch - 2ms/step Epoch 65/150 118/118 - 0s - loss: 0.2927 - accuracy: 0.8745 - val_loss: 0.3145 - val_accuracy: 0.8565 - 257ms/epoch - 2ms/step Epoch 66/150 118/118 - 0s - loss: 0.2850 - accuracy: 0.8772 - val_loss: 0.3109 - val_accuracy: 0.8589 - 261ms/epoch - 2ms/step Epoch 67/150 118/118 - 0s - loss: 0.2875 - accuracy: 0.8788 - val_loss: 0.3079 - val_accuracy: 0.8684 - 261ms/epoch - 2ms/step Epoch 68/150 118/118 - 0s - loss: 0.3010 - accuracy: 0.8700 - val_loss: 0.2882 - val_accuracy: 0.8804 - 260ms/epoch - 2ms/step Epoch 69/150 118/118 - 0s - loss: 0.2835 - accuracy: 0.8839 - val_loss: 0.2925 - val_accuracy: 0.8756 - 253ms/epoch - 2ms/step Epoch 70/150 118/118 - 0s - loss: 0.2798 - accuracy: 0.8815 - val_loss: 0.3258 - val_accuracy: 0.8517 - 259ms/epoch - 2ms/step Epoch 71/150 118/118 - 0s - loss: 0.3005 - accuracy: 0.8711 - val_loss: 0.2975 - val_accuracy: 0.8612 - 258ms/epoch - 2ms/step Epoch 72/150 118/118 - 0s - loss: 0.2806 - accuracy: 0.8817 - val_loss: 0.2960 - val_accuracy: 0.8589 - 252ms/epoch - 2ms/step Epoch 73/150 118/118 - 0s - loss: 0.2816 - accuracy: 0.8804 - val_loss: 0.2945 - val_accuracy: 0.8612 - 259ms/epoch - 2ms/step Epoch 74/150 118/118 - 0s - loss: 0.2833 - accuracy: 0.8745 - val_loss: 0.3082 - val_accuracy: 0.8589 - 262ms/epoch - 2ms/step Epoch 75/150 118/118 - 0s - loss: 0.2771 - accuracy: 0.8767 - val_loss: 0.2929 - val_accuracy: 0.8541 - 267ms/epoch - 2ms/step Epoch 76/150 118/118 - 0s - loss: 0.2775 - accuracy: 0.8780 - val_loss: 0.2912 - val_accuracy: 0.8636 - 256ms/epoch - 2ms/step Epoch 77/150 118/118 - 0s - loss: 0.2903 - accuracy: 0.8748 - val_loss: 0.2932 - val_accuracy: 0.8565 - 251ms/epoch - 2ms/step Epoch 78/150 118/118 - 0s - loss: 0.2693 - accuracy: 0.8857 - val_loss: 0.3043 - val_accuracy: 0.8565 - 258ms/epoch - 2ms/step Epoch 79/150 118/118 - 0s - loss: 0.2860 - accuracy: 0.8745 - val_loss: 0.2982 - val_accuracy: 0.8612 - 265ms/epoch - 2ms/step Epoch 80/150 118/118 - 0s - loss: 0.2626 - accuracy: 0.8817 - val_loss: 0.3051 - val_accuracy: 0.8589 - 261ms/epoch - 2ms/step Epoch 81/150 118/118 - 0s - loss: 0.2813 - accuracy: 0.8769 - val_loss: 0.2959 - val_accuracy: 0.8660 - 262ms/epoch - 2ms/step Epoch 82/150 118/118 - 0s - loss: 0.2668 - accuracy: 0.8873 - val_loss: 0.3084 - val_accuracy: 0.8517 - 313ms/epoch - 3ms/step Epoch 83/150 118/118 - 0s - loss: 0.2658 - accuracy: 0.8809 - val_loss: 0.2866 - val_accuracy: 0.8636 - 262ms/epoch - 2ms/step Epoch 84/150 118/118 - 0s - loss: 0.2762 - accuracy: 0.8823 - val_loss: 0.3120 - val_accuracy: 0.8565 - 262ms/epoch - 2ms/step Epoch 85/150 118/118 - 0s - loss: 0.2638 - accuracy: 0.8812 - val_loss: 0.2878 - val_accuracy: 0.8684 - 282ms/epoch - 2ms/step Epoch 86/150 118/118 - 0s - loss: 0.2724 - accuracy: 0.8809 - val_loss: 0.2844 - val_accuracy: 0.8780 - 275ms/epoch - 2ms/step Epoch 87/150 118/118 - 0s - loss: 0.2611 - accuracy: 0.8825 - val_loss: 0.3035 - val_accuracy: 0.8565 - 282ms/epoch - 2ms/step Epoch 88/150 118/118 - 0s - loss: 0.2648 - accuracy: 0.8812 - val_loss: 0.2843 - val_accuracy: 0.8636 - 263ms/epoch - 2ms/step Epoch 89/150 118/118 - 0s - loss: 0.2677 - accuracy: 0.8756 - val_loss: 0.2990 - val_accuracy: 0.8612 - 278ms/epoch - 2ms/step Epoch 90/150 118/118 - 0s - loss: 0.2711 - accuracy: 0.8799 - val_loss: 0.2802 - val_accuracy: 0.8708 - 267ms/epoch - 2ms/step Epoch 91/150 118/118 - 0s - loss: 0.2672 - accuracy: 0.8887 - val_loss: 0.2956 - val_accuracy: 0.8636 - 263ms/epoch - 2ms/step Epoch 92/150 118/118 - 0s - loss: 0.2673 - accuracy: 0.8841 - val_loss: 0.3012 - val_accuracy: 0.8589 - 271ms/epoch - 2ms/step Epoch 93/150 118/118 - 0s - loss: 0.2596 - accuracy: 0.8804 - val_loss: 0.2940 - val_accuracy: 0.8708 - 272ms/epoch - 2ms/step Epoch 94/150 118/118 - 0s - loss: 0.2644 - accuracy: 0.8879 - val_loss: 0.2847 - val_accuracy: 0.8756 - 294ms/epoch - 2ms/step Epoch 95/150 118/118 - 0s - loss: 0.2649 - accuracy: 0.8839 - val_loss: 0.2977 - val_accuracy: 0.8589 - 286ms/epoch - 2ms/step Epoch 96/150 118/118 - 0s - loss: 0.2653 - accuracy: 0.8759 - val_loss: 0.2723 - val_accuracy: 0.8780 - 268ms/epoch - 2ms/step Epoch 97/150 118/118 - 0s - loss: 0.2715 - accuracy: 0.8831 - val_loss: 0.2883 - val_accuracy: 0.8684 - 270ms/epoch - 2ms/step Epoch 98/150 118/118 - 0s - loss: 0.2590 - accuracy: 0.8849 - val_loss: 0.2929 - val_accuracy: 0.8612 - 273ms/epoch - 2ms/step Epoch 99/150 118/118 - 0s - loss: 0.2590 - accuracy: 0.8836 - val_loss: 0.2761 - val_accuracy: 0.8684 - 266ms/epoch - 2ms/step Epoch 100/150 118/118 - 0s - loss: 0.2674 - accuracy: 0.8783 - val_loss: 0.2880 - val_accuracy: 0.8684 - 252ms/epoch - 2ms/step Epoch 101/150 118/118 - 0s - loss: 0.2560 - accuracy: 0.8796 - val_loss: 0.3014 - val_accuracy: 0.8589 - 258ms/epoch - 2ms/step Epoch 102/150 118/118 - 0s - loss: 0.2526 - accuracy: 0.8879 - val_loss: 0.2912 - val_accuracy: 0.8708 - 269ms/epoch - 2ms/step Epoch 103/150 118/118 - 0s - loss: 0.2507 - accuracy: 0.8868 - val_loss: 0.2888 - val_accuracy: 0.8684 - 261ms/epoch - 2ms/step Epoch 104/150 118/118 - 0s - loss: 0.2453 - accuracy: 0.8887 - val_loss: 0.2904 - val_accuracy: 0.8636 - 261ms/epoch - 2ms/step Epoch 105/150 118/118 - 0s - loss: 0.2491 - accuracy: 0.8873 - val_loss: 0.2812 - val_accuracy: 0.8780 - 259ms/epoch - 2ms/step Epoch 106/150 118/118 - 0s - loss: 0.2575 - accuracy: 0.8865 - val_loss: 0.2942 - val_accuracy: 0.8612 - 262ms/epoch - 2ms/step Epoch 107/150 118/118 - 0s - loss: 0.2415 - accuracy: 0.8932 - val_loss: 0.2889 - val_accuracy: 0.8612 - 255ms/epoch - 2ms/step Epoch 108/150 118/118 - 0s - loss: 0.2596 - accuracy: 0.8857 - val_loss: 0.2878 - val_accuracy: 0.8589 - 261ms/epoch - 2ms/step Epoch 109/150 118/118 - 0s - loss: 0.2552 - accuracy: 0.8807 - val_loss: 0.2740 - val_accuracy: 0.8780 - 263ms/epoch - 2ms/step Epoch 110/150 118/118 - 0s - loss: 0.2448 - accuracy: 0.8892 - val_loss: 0.2778 - val_accuracy: 0.8708 - 274ms/epoch - 2ms/step Epoch 111/150 118/118 - 0s - loss: 0.2503 - accuracy: 0.8910 - val_loss: 0.3007 - val_accuracy: 0.8660 - 268ms/epoch - 2ms/step Epoch 112/150 118/118 - 0s - loss: 0.2448 - accuracy: 0.8961 - val_loss: 0.2984 - val_accuracy: 0.8636 - 263ms/epoch - 2ms/step Epoch 113/150 118/118 - 0s - loss: 0.2452 - accuracy: 0.8937 - val_loss: 0.2908 - val_accuracy: 0.8684 - 258ms/epoch - 2ms/step Epoch 114/150 118/118 - 0s - loss: 0.2524 - accuracy: 0.8903 - val_loss: 0.2934 - val_accuracy: 0.8660 - 257ms/epoch - 2ms/step Epoch 115/150 118/118 - 0s - loss: 0.2398 - accuracy: 0.8969 - val_loss: 0.2824 - val_accuracy: 0.8804 - 275ms/epoch - 2ms/step Epoch 116/150 118/118 - 0s - loss: 0.2366 - accuracy: 0.8969 - val_loss: 0.2924 - val_accuracy: 0.8660 - 275ms/epoch - 2ms/step Epoch 117/150 118/118 - 0s - loss: 0.2465 - accuracy: 0.8903 - val_loss: 0.2719 - val_accuracy: 0.8732 - 272ms/epoch - 2ms/step Epoch 118/150 118/118 - 0s - loss: 0.2539 - accuracy: 0.8865 - val_loss: 0.2748 - val_accuracy: 0.8660 - 272ms/epoch - 2ms/step Epoch 119/150 118/118 - 0s - loss: 0.2491 - accuracy: 0.8887 - val_loss: 0.2843 - val_accuracy: 0.8708 - 263ms/epoch - 2ms/step Epoch 120/150 118/118 - 0s - loss: 0.2388 - accuracy: 0.8945 - val_loss: 0.2811 - val_accuracy: 0.8732 - 265ms/epoch - 2ms/step Epoch 121/150 118/118 - 0s - loss: 0.2412 - accuracy: 0.8942 - val_loss: 0.2813 - val_accuracy: 0.8684 - 265ms/epoch - 2ms/step Epoch 122/150 118/118 - 0s - loss: 0.2354 - accuracy: 0.8953 - val_loss: 0.2861 - val_accuracy: 0.8660 - 265ms/epoch - 2ms/step Epoch 123/150 118/118 - 0s - loss: 0.2331 - accuracy: 0.8969 - val_loss: 0.2879 - val_accuracy: 0.8660 - 271ms/epoch - 2ms/step Epoch 124/150 118/118 - 0s - loss: 0.2369 - accuracy: 0.8924 - val_loss: 0.2827 - val_accuracy: 0.8780 - 264ms/epoch - 2ms/step Epoch 125/150 118/118 - 0s - loss: 0.2559 - accuracy: 0.8873 - val_loss: 0.2958 - val_accuracy: 0.8708 - 274ms/epoch - 2ms/step Epoch 126/150 118/118 - 0s - loss: 0.2498 - accuracy: 0.8924 - val_loss: 0.2845 - val_accuracy: 0.8756 - 257ms/epoch - 2ms/step Epoch 127/150 118/118 - 0s - loss: 0.2424 - accuracy: 0.8948 - val_loss: 0.2790 - val_accuracy: 0.8684 - 259ms/epoch - 2ms/step Epoch 128/150 118/118 - 0s - loss: 0.2418 - accuracy: 0.8921 - val_loss: 0.2802 - val_accuracy: 0.8684 - 256ms/epoch - 2ms/step Epoch 129/150 118/118 - 0s - loss: 0.2297 - accuracy: 0.8969 - val_loss: 0.2908 - val_accuracy: 0.8708 - 260ms/epoch - 2ms/step Epoch 130/150 118/118 - 0s - loss: 0.2411 - accuracy: 0.8879 - val_loss: 0.2781 - val_accuracy: 0.8684 - 263ms/epoch - 2ms/step Epoch 131/150 118/118 - 0s - loss: 0.2426 - accuracy: 0.8868 - val_loss: 0.2800 - val_accuracy: 0.8684 - 301ms/epoch - 3ms/step Epoch 132/150 118/118 - 0s - loss: 0.2382 - accuracy: 0.8990 - val_loss: 0.2987 - val_accuracy: 0.8612 - 294ms/epoch - 2ms/step Epoch 133/150 118/118 - 0s - loss: 0.2295 - accuracy: 0.8932 - val_loss: 0.2828 - val_accuracy: 0.8612 - 308ms/epoch - 3ms/step Epoch 134/150 118/118 - 0s - loss: 0.2344 - accuracy: 0.8974 - val_loss: 0.2869 - val_accuracy: 0.8589 - 300ms/epoch - 3ms/step Epoch 135/150 118/118 - 0s - loss: 0.2408 - accuracy: 0.8940 - val_loss: 0.2779 - val_accuracy: 0.8804 - 269ms/epoch - 2ms/step Epoch 136/150 118/118 - 0s - loss: 0.2310 - accuracy: 0.8958 - val_loss: 0.2715 - val_accuracy: 0.8780 - 278ms/epoch - 2ms/step Epoch 137/150 118/118 - 0s - loss: 0.2445 - accuracy: 0.8926 - val_loss: 0.2968 - val_accuracy: 0.8636 - 282ms/epoch - 2ms/step Epoch 138/150 118/118 - 0s - loss: 0.2360 - accuracy: 0.8956 - val_loss: 0.2849 - val_accuracy: 0.8684 - 293ms/epoch - 2ms/step Epoch 139/150 118/118 - 0s - loss: 0.2336 - accuracy: 0.8905 - val_loss: 0.2744 - val_accuracy: 0.8804 - 284ms/epoch - 2ms/step Epoch 140/150 118/118 - 0s - loss: 0.2369 - accuracy: 0.8977 - val_loss: 0.2875 - val_accuracy: 0.8756 - 277ms/epoch - 2ms/step Epoch 141/150 118/118 - 0s - loss: 0.2376 - accuracy: 0.8913 - val_loss: 0.2850 - val_accuracy: 0.8756 - 278ms/epoch - 2ms/step Epoch 142/150 118/118 - 0s - loss: 0.2315 - accuracy: 0.8985 - val_loss: 0.2776 - val_accuracy: 0.8780 - 284ms/epoch - 2ms/step Epoch 143/150 118/118 - 0s - loss: 0.2417 - accuracy: 0.8950 - val_loss: 0.2962 - val_accuracy: 0.8732 - 299ms/epoch - 3ms/step Epoch 144/150 118/118 - 0s - loss: 0.2325 - accuracy: 0.8956 - val_loss: 0.2905 - val_accuracy: 0.8708 - 332ms/epoch - 3ms/step Epoch 145/150 118/118 - 0s - loss: 0.2240 - accuracy: 0.8977 - val_loss: 0.2767 - val_accuracy: 0.8828 - 281ms/epoch - 2ms/step Epoch 146/150 118/118 - 0s - loss: 0.2321 - accuracy: 0.8969 - val_loss: 0.2791 - val_accuracy: 0.8780 - 280ms/epoch - 2ms/step Epoch 147/150 118/118 - 0s - loss: 0.2229 - accuracy: 0.9028 - val_loss: 0.2718 - val_accuracy: 0.8780 - 277ms/epoch - 2ms/step Epoch 148/150 118/118 - 0s - loss: 0.2318 - accuracy: 0.8993 - val_loss: 0.2870 - val_accuracy: 0.8852 - 286ms/epoch - 2ms/step Epoch 149/150 118/118 - 0s - loss: 0.2325 - accuracy: 0.8929 - val_loss: 0.2951 - val_accuracy: 0.8756 - 288ms/epoch - 2ms/step Epoch 150/150 118/118 - 0s - loss: 0.2347 - accuracy: 0.8929 - val_loss: 0.2770 - val_accuracy: 0.8876 - 262ms/epoch - 2ms/step
# Plotting Accuracy vs Epochs
plt.plot(history_3.history['accuracy'])
plt.plot(history_3.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
Observation:
As training progresses, both the training and validation accuracies gradually increase, indicating that the model is learning and improving its ability to make accurate predictions. After 100 epochs the trend starts to stabilize, with less significant improvements.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_nadam.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_nadam.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the train data
y_pred_train_ann_nadam = ann_model_nadam.predict(X_train_scaled)
y_pred_train_ann_nadam = (y_pred_train_ann_nadam > 0.5).astype(int)
# Confusion matrix for the train data
metrics_score(y_train, y_pred_train_ann_nadam)
131/131 [==============================] - 0s 1ms/step - loss: 0.1948 - accuracy: 0.9192
Train Accuracy: 91.92%
56/56 [==============================] - 0s 1ms/step - loss: 0.3032 - accuracy: 0.8876
Test Accuracy: 88.76%
131/131 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.99 0.91 0.95 3340
1 0.72 0.96 0.83 832
accuracy 0.92 4172
macro avg 0.86 0.93 0.89 4172
weighted avg 0.94 0.92 0.92 4172
Observation:
The recall for class 1 on the train data is high at 0.96, indicating that the model is effective at identifying most of the positive cases in the training data.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_nadam.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_nadam.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the test data
y_pred_test_ann_nadam = ann_model_nadam.predict(X_test_scaled)
y_pred_test_ann_nadam = (y_pred_test_ann_nadam > 0.5).astype(int)
# Confusion matrix for the test data
metrics_score(y_test, y_pred_test_ann_nadam)
131/131 [==============================] - 0s 1ms/step - loss: 0.1948 - accuracy: 0.9192
Train Accuracy: 91.92%
56/56 [==============================] - 0s 1ms/step - loss: 0.3032 - accuracy: 0.8876
Test Accuracy: 88.76%
56/56 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.95 0.90 0.93 1431
1 0.68 0.83 0.75 357
accuracy 0.89 1788
macro avg 0.82 0.87 0.84 1788
weighted avg 0.90 0.89 0.89 1788
Observation:
The test data exhibits a robust recall of 0.83 for class 1, with only a minor decrease in precision.
# Clearing the backend
K.clear_session()
# Fixing the seed for random number generators
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
# Function to create the model (accepts hyperparameters as arguments)
def create_model2 (learn_rate=0.01, neurons=256, dropout_rate=0.3):
model = Sequential()
model.add(Dense(neurons, activation='relu', input_dim=X_train_scaled.shape[1]))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons // 2, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(neurons // 4, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
# Compile the model
optimizer = tf.keras.optimizers.Nadam(learning_rate=learn_rate)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy'])
return model
# Wrap the model with KerasClassifier
keras_estimator = KerasClassifier(model=create_model2, verbose=1)
# Define the grid search parameters
param_random = {
'model__learn_rate': [0.01, 0.05, 0.001],
'model__neurons': [40, 50, 60, 70, 80],
'model__dropout_rate': [0.2, 0.3, 0.4],
'batch_size': [32, 64, 128]
}
# RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=keras_estimator, param_distributions=param_random,
n_iter=25, cv=5, verbose=2, n_jobs=-1, random_state=1)
# Fitting the model
random_search_result = random_search.fit(X_train_scaled, y_train, validation_split=0.2)
# Best parameters
print("Best: %f using %s" % (random_search_result.best_score_, random_search_result.best_params_))
Fitting 5 folds for each of 25 candidates, totalling 125 fits
53/53 [==============================] - 2s 7ms/step - loss: 1.8576 - accuracy: 0.7648 - val_loss: 1.8186 - val_accuracy: 0.8180
Best: 0.845392 using {'model__neurons': 80, 'model__learn_rate': 0.05, 'model__dropout_rate': 0.3, 'batch_size': 64}
# Check the optimized hyperparameters
random_search_result.best_params_
{'model__neurons': 80,
'model__learn_rate': 0.05,
'model__dropout_rate': 0.3,
'batch_size': 64}
Build ANN-Nadam model with optimized hyperparameters
# Use the best hyperparameters found by RandomizedSearchCV
best_learn_rate = random_search_result.best_params_['model__learn_rate']
best_batch_size = random_search_result.best_params_['batch_size']
# Create a new model with these best hyperparameters
ann_model_nadam_tuned = create_model(learn_rate=best_learn_rate)
# Compile the model
ann_model_nadam_tuned.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adamax(learning_rate=best_learn_rate), metrics=['accuracy'])
# Fit the best model to the training data
history_4 = ann_model_nadam_tuned.fit(X_train_scaled, y_train, epochs=50, batch_size=best_batch_size, verbose=1, validation_split=0.2, class_weight=class_weights_dict)
Epoch 1/50 53/53 [==============================] - 1s 8ms/step - loss: 2.1458 - accuracy: 0.7740 - val_loss: 1.4902 - val_accuracy: 0.8551 Epoch 2/50 53/53 [==============================] - 0s 5ms/step - loss: 1.4751 - accuracy: 0.8367 - val_loss: 1.5065 - val_accuracy: 0.8587 Epoch 3/50 53/53 [==============================] - 0s 4ms/step - loss: 1.3922 - accuracy: 0.8472 - val_loss: 1.3845 - val_accuracy: 0.8479 Epoch 4/50 53/53 [==============================] - 0s 4ms/step - loss: 1.3153 - accuracy: 0.8559 - val_loss: 1.4040 - val_accuracy: 0.8443 Epoch 5/50 53/53 [==============================] - 0s 4ms/step - loss: 1.2871 - accuracy: 0.8556 - val_loss: 1.3304 - val_accuracy: 0.8311 Epoch 6/50 53/53 [==============================] - 0s 5ms/step - loss: 1.2640 - accuracy: 0.8538 - val_loss: 1.3579 - val_accuracy: 0.8599 Epoch 7/50 53/53 [==============================] - 0s 4ms/step - loss: 1.2150 - accuracy: 0.8520 - val_loss: 1.4822 - val_accuracy: 0.8766 Epoch 8/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1817 - accuracy: 0.8636 - val_loss: 1.3123 - val_accuracy: 0.8251 Epoch 9/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1822 - accuracy: 0.8514 - val_loss: 1.3233 - val_accuracy: 0.8491 Epoch 10/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1149 - accuracy: 0.8565 - val_loss: 1.4606 - val_accuracy: 0.8623 Epoch 11/50 53/53 [==============================] - 0s 4ms/step - loss: 1.1216 - accuracy: 0.8550 - val_loss: 1.3381 - val_accuracy: 0.8551 Epoch 12/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0711 - accuracy: 0.8651 - val_loss: 1.2972 - val_accuracy: 0.8719 Epoch 13/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0959 - accuracy: 0.8762 - val_loss: 1.3000 - val_accuracy: 0.8743 Epoch 14/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0221 - accuracy: 0.8735 - val_loss: 1.3742 - val_accuracy: 0.8719 Epoch 15/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0329 - accuracy: 0.8690 - val_loss: 1.2316 - val_accuracy: 0.8754 Epoch 16/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9523 - accuracy: 0.8732 - val_loss: 1.1046 - val_accuracy: 0.8611 Epoch 17/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0083 - accuracy: 0.8777 - val_loss: 1.1749 - val_accuracy: 0.8539 Epoch 18/50 53/53 [==============================] - 0s 4ms/step - loss: 1.0259 - accuracy: 0.8750 - val_loss: 1.1257 - val_accuracy: 0.8719 Epoch 19/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9404 - accuracy: 0.8783 - val_loss: 1.2668 - val_accuracy: 0.8683 Epoch 20/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9070 - accuracy: 0.8819 - val_loss: 1.2154 - val_accuracy: 0.8778 Epoch 21/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8828 - accuracy: 0.8843 - val_loss: 1.0326 - val_accuracy: 0.8719 Epoch 22/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8851 - accuracy: 0.8819 - val_loss: 1.2200 - val_accuracy: 0.8527 Epoch 23/50 53/53 [==============================] - 0s 4ms/step - loss: 0.9287 - accuracy: 0.8801 - val_loss: 1.0958 - val_accuracy: 0.8790 Epoch 24/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8190 - accuracy: 0.8876 - val_loss: 1.1305 - val_accuracy: 0.8707 Epoch 25/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8551 - accuracy: 0.8873 - val_loss: 1.3355 - val_accuracy: 0.8826 Epoch 26/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8065 - accuracy: 0.8834 - val_loss: 1.2795 - val_accuracy: 0.8635 Epoch 27/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8061 - accuracy: 0.8732 - val_loss: 1.1226 - val_accuracy: 0.8850 Epoch 28/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8024 - accuracy: 0.8861 - val_loss: 1.2578 - val_accuracy: 0.8790 Epoch 29/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7546 - accuracy: 0.8828 - val_loss: 1.0510 - val_accuracy: 0.8659 Epoch 30/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8060 - accuracy: 0.8801 - val_loss: 1.1265 - val_accuracy: 0.8623 Epoch 31/50 53/53 [==============================] - 0s 4ms/step - loss: 0.8021 - accuracy: 0.8855 - val_loss: 1.1158 - val_accuracy: 0.8635 Epoch 32/50 53/53 [==============================] - 0s 5ms/step - loss: 0.7611 - accuracy: 0.8885 - val_loss: 1.2488 - val_accuracy: 0.8335 Epoch 33/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7603 - accuracy: 0.8873 - val_loss: 1.1818 - val_accuracy: 0.8635 Epoch 34/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7041 - accuracy: 0.9014 - val_loss: 1.1856 - val_accuracy: 0.8659 Epoch 35/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7216 - accuracy: 0.8939 - val_loss: 1.3781 - val_accuracy: 0.8790 Epoch 36/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6980 - accuracy: 0.9062 - val_loss: 1.8307 - val_accuracy: 0.8898 Epoch 37/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7717 - accuracy: 0.8963 - val_loss: 1.5206 - val_accuracy: 0.8814 Epoch 38/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7234 - accuracy: 0.8930 - val_loss: 1.2036 - val_accuracy: 0.8731 Epoch 39/50 53/53 [==============================] - 0s 4ms/step - loss: 0.7942 - accuracy: 0.8861 - val_loss: 1.3698 - val_accuracy: 0.8743 Epoch 40/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6676 - accuracy: 0.9074 - val_loss: 1.1882 - val_accuracy: 0.8814 Epoch 41/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6620 - accuracy: 0.9074 - val_loss: 1.4530 - val_accuracy: 0.8874 Epoch 42/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6605 - accuracy: 0.9041 - val_loss: 1.4732 - val_accuracy: 0.8898 Epoch 43/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6566 - accuracy: 0.9074 - val_loss: 1.4422 - val_accuracy: 0.8802 Epoch 44/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6332 - accuracy: 0.9086 - val_loss: 1.7906 - val_accuracy: 0.8898 Epoch 45/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6161 - accuracy: 0.9044 - val_loss: 1.7450 - val_accuracy: 0.8838 Epoch 46/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6631 - accuracy: 0.9053 - val_loss: 1.4241 - val_accuracy: 0.8814 Epoch 47/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6819 - accuracy: 0.9080 - val_loss: 1.1808 - val_accuracy: 0.8754 Epoch 48/50 53/53 [==============================] - 0s 4ms/step - loss: 0.5932 - accuracy: 0.9113 - val_loss: 1.6139 - val_accuracy: 0.8826 Epoch 49/50 53/53 [==============================] - 0s 4ms/step - loss: 0.6196 - accuracy: 0.9083 - val_loss: 1.4450 - val_accuracy: 0.8647 Epoch 50/50 53/53 [==============================] - 0s 4ms/step - loss: 0.5877 - accuracy: 0.9092 - val_loss: 1.5746 - val_accuracy: 0.8539
# Plotting Accuracy vs Epochs
plt.plot(history_4.history['accuracy'])
plt.plot(history_4.history['val_accuracy'])
plt.title('Accuracy vs Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()
Observation:
Overall accuracy increases as the epochs increase. In later epochs, there may be a slight divergence between the training and validation accuracy, suggesting that the model could be starting to overfit the training data.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_nadam_tuned.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_nadam_tuned.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the train data
y_pred_train_ann_nadam_tuned = ann_model_nadam_tuned.predict(X_train_scaled)
y_pred_train_ann_nadam_tuned = (y_pred_train_ann_nadam_tuned > 0.5).astype(int)
# Confusion matrix for the train data
metrics_score(y_train, y_pred_train_ann_nadam_tuned)
131/131 [==============================] - 0s 1ms/step - loss: 0.6278 - accuracy: 0.9070
Train Accuracy: 90.70%
56/56 [==============================] - 0s 1ms/step - loss: 1.8005 - accuracy: 0.8775
Test Accuracy: 87.75%
131/131 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.99 0.89 0.94 3340
1 0.69 0.96 0.80 832
accuracy 0.91 4172
macro avg 0.84 0.93 0.87 4172
weighted avg 0.93 0.91 0.91 4172
Observation:
The recall for class 1 on the train data is high at 0.96, indicating that the model is effective at identifying most of the positive cases in the training data.
# Checking performance on the training data
train_loss, train_accuracy = ann_model_nadam_tuned.evaluate(X_train_scaled, y_train)
print("Train Accuracy: {:.2f}%".format(train_accuracy * 100))
# Evaluating performance on the test data
test_loss, test_accuracy = ann_model_nadam_tuned.evaluate(X_test_scaled, y_test)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))
# Predicting on the test data
y_pred_test_ann_nadam_tuned = ann_model_nadam_tuned.predict(X_test_scaled)
y_pred_test_ann_nadam_tuned = (y_pred_test_ann_nadam_tuned > 0.5).astype(int)
# Confusion matrix for the test data
metrics_score(y_test, y_pred_test_ann_nadam_tuned)
131/131 [==============================] - 0s 2ms/step - loss: 0.6278 - accuracy: 0.9070
Train Accuracy: 90.70%
56/56 [==============================] - 0s 2ms/step - loss: 1.8005 - accuracy: 0.8775
Test Accuracy: 87.75%
56/56 [==============================] - 0s 1ms/step
precision recall f1-score support
0 0.96 0.89 0.92 1431
1 0.65 0.84 0.73 357
accuracy 0.88 1788
macro avg 0.80 0.86 0.83 1788
weighted avg 0.90 0.88 0.88 1788
Observation:
Recall for class 1 on the test data is strong at 0.84. Although there is a slight decrease in precision for class 1, the average metrics remain robust.
# Clear the backend
K.clear_session()
# Create the SHAP DeepExplainer
explainer = sh.DeepExplainer(ann_model_nadam_tuned, X_train_scaled.values[:100]) # Using .values and a subset for efficiency
# Compute SHAP values for the test data
shap_values = explainer.shap_values(X_test_scaled.values)
# Plot the SHAP summary plot
sh.summary_plot(shap_values[0], X_test_scaled, feature_names=X_train_scaled.columns)
Observations:
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design:
# Function for generating a comparison frame for the recall on class 1 for a list of models
def compile_recalls_class1(models, model_datasets, model_class, model_names, custom_thresholds=None):
"""
Compares the recall of class 1 for different models on their respective train and test datasets.
Args:
models (list): A list of models to compare.
model_datasets (list): A list of tuples, each containing the X_train, y_train, X_test, y_test for each model.
model_class (list): A list of human-readable model class names.
model_names (list): A list of model names.
custom_thresholds (dict, optional): A dictionary of model names and their custom thresholds.
Returns:
DataFrame: A DataFrame with the recall of class 1 for each model on train and test data.
"""
recall_scores = []
for i, model in enumerate(models):
X_train, y_train, X_test, y_test = model_datasets[i]
model_class_name = model_class[i]
model_code_name = model_names[i]
# Determine if custom threshold is set for the model
threshold = custom_thresholds[model_code_name] if model_code_name in custom_thresholds else 0.5
# Train recall
if 'ann_model' in model_code_name: # Check if the model is an ANN model
train_pred = (model.predict(X_train) > threshold).astype(int)
else:
train_pred = model.predict(X_train) if model_code_name not in custom_thresholds else (model.predict_proba(X_train)[:, 1] > threshold).astype(int)
train_recall = recall_score(y_train, train_pred, pos_label=1)
# Test recall
if 'ann_model' in model_code_name: # Check if the model is an ANN model
test_pred = (model.predict(X_test) > threshold).astype(int)
else:
test_pred = model.predict(X_test) if model_code_name not in custom_thresholds else (model.predict_proba(X_test)[:, 1] > threshold).astype(int)
test_recall = recall_score(y_test, test_pred, pos_label=1)
recall_scores.append([model_class_name, model_code_name, train_recall, test_recall])
return pd.DataFrame(recall_scores, columns=['Model Class', 'Model Name', 'Recall_Train_Class_1', 'Recall_Test_Class_1'])
# Function for generating a comparison frame for macro averages for a list of models
def compile_macro_averages(models, model_datasets, model_class, model_names, custom_thresholds=None):
"""
Compares precision, recall, and accuracy for different models on their respective train and test datasets.
Args:
models (list): A list of models to compare.
model_datasets (list): A list of tuples, each containing the X_train, y_train, X_test, y_test for each model.
model_class (list): A list of human-readable model class names.
model_names (list): A list of model names.
custom_thresholds (dict, optional): A dictionary of model names and their custom thresholds.
Returns:
DataFrame: A DataFrame with metrics for each model on train and test data.
"""
metrics_scores = []
for i, model in enumerate(models):
X_train, y_train, X_test, y_test = model_datasets[i]
model_class_name = model_class[i]
model_code_name = model_names[i]
# Determine if custom threshold is set for the model
threshold = custom_thresholds[model_code_name] if model_code_name in custom_thresholds else 0.5
# Train metrics
if 'ann_model' in model_code_name: # Check if the model is an ANN model
train_pred = (model.predict(X_train) > threshold).astype(int)
else:
train_pred = model.predict(X_train) if model_code_name not in custom_thresholds else (model.predict_proba(X_train)[:, 1] > threshold).astype(int)
train_precision = precision_score(y_train, train_pred, average='macro')
train_recall = recall_score(y_train, train_pred, average='macro')
train_accuracy = accuracy_score(y_train, train_pred)
# Test metrics
if 'ann_model' in model_code_name: # Check if the model is an ANN model
test_pred = (model.predict(X_test) > threshold).astype(int)
else:
test_pred = model.predict(X_test) if model_code_name not in custom_thresholds else (model.predict_proba(X_test)[:, 1] > threshold).astype(int)
test_precision = precision_score(y_test, test_pred, average='macro')
test_recall = recall_score(y_test, test_pred, average='macro')
test_accuracy = accuracy_score(y_test, test_pred)
metrics_scores.append([model_class_name, model_code_name, train_precision, test_precision, train_recall, test_recall, train_accuracy, test_accuracy])
return pd.DataFrame(metrics_scores, columns=['Model Class', 'Model Name', 'Precision_Train', 'Precision_Test', 'Recall_Train', 'Recall_Test', 'Accuracy_Train', 'Accuracy_Test'])
# Function for generating a comparison frame for binary metrics on class 1 for a list of models
def compile_binary_metrics_class1(models, model_datasets, model_class, model_names, custom_thresholds=None):
"""
Compares binary metrics (precision, recall, F1-score) for class 1 for different models on their respective train and test datasets.
Args:
models (list): A list of models to compare.
model_datasets (list): A list of tuples, each containing the X_train, y_train, X_test, y_test for each model.
model_class (list): A list of human-readable model class names.
model_names (list): A list of model names.
custom_thresholds (dict, optional): A dictionary of model names and their custom thresholds.
Returns:
DataFrame: A DataFrame with metrics for each model on train and test data.
"""
metrics_scores = []
for i, model in enumerate(models):
X_train, y_train, X_test, y_test = model_datasets[i]
model_class_name = model_class[i]
model_code_name = model_names[i]
# Determine if custom threshold is set for the model
threshold = custom_thresholds[model_code_name] if model_code_name in custom_thresholds else 0.5
# Train metrics
if 'ann_model' in model_code_name: # Check if the model is an ANN model
train_pred = (model.predict(X_train) > threshold).astype(int)
else:
train_pred = model.predict(X_train) if model_code_name not in custom_thresholds else (model.predict_proba(X_train)[:, 1] > threshold).astype(int)
train_precision = precision_score(y_train, train_pred, pos_label=1)
train_recall = recall_score(y_train, train_pred, pos_label=1)
train_f1 = f1_score(y_train, train_pred, pos_label=1)
# Test metrics
if 'ann_model' in model_code_name: # Check if the model is an ANN model
test_pred = (model.predict(X_test) > threshold).astype(int)
else:
test_pred = model.predict(X_test) if model_code_name not in custom_thresholds else (model.predict_proba(X_test)[:, 1] > threshold).astype(int)
test_precision = precision_score(y_test, test_pred, pos_label=1)
test_recall = recall_score(y_test, test_pred, pos_label=1)
test_f1 = f1_score(y_test, test_pred, pos_label=1)
metrics_scores.append([model_class_name, model_code_name, train_precision, test_precision, train_recall, test_recall, train_f1, test_f1])
return pd.DataFrame(metrics_scores, columns=['Model Class', 'Model Name', 'Train_Precision_Class_1', 'Test_Precision_Class_1', 'Train_Recall_Class_1', 'Test_Recall_Class_1', 'Train_F1_Class_1', 'Test_F1_Class_1'])
# Building lists required in the functions
# Define models
models = [
lg,
lg_selected,
lg_selected_poly,
lg_selected_poly_optimized,
dt,
dt_tuned,
dt_tuned_selected,
rf_estimator,
rf_estimator_weighted,
rf_estimator_tuned,
knn,
knn_tuned,
lda,
lda_optimized,
qda,
svm,
svm_tuned,
adaboost,
adaboost_tuned,
gbc,
gbc_tuned,
xgb,
xgb_tuned,
ann_model,
ann_model_tuned,
ann_model_nadam,
ann_model_nadam_tuned
]
# Corresponding datasets for each model
model_datasets = [
(X_train_scaled, y_train, X_test_scaled, y_test), # for lg
(X_train_selected_scaled, y_train, X_test_selected_scaled, y_test), # for lg_selected
(X_train_selected_poly, y_train, X_test_selected_poly, y_test), # for lg_selected_poly
(X_train_selected_poly, y_train, X_test_selected_poly, y_test), # for lg_selected_poly_optimized
(X_train, y_train, X_test, y_test), # for dt
(X_train, y_train, X_test, y_test), # for dt_tuned
(X_train_selected, y_train, X_test_selected, y_test), # for dt_tuned_selected
(X_train, y_train, X_test, y_test), # for rf_estimator
(X_train, y_train, X_test, y_test), # for rf_estimator_weighted
(X_train, y_train, X_test, y_test), # for rf_estimator_tuned
(X_train_scaled, y_train, X_test_scaled, y_test), # for knn
(X_train_scaled, y_train, X_test_scaled, y_test), # for knn_tuned
(X_train_scaled, y_train, X_test_scaled, y_test), # for lda
(X_train_scaled, y_train, X_test_scaled, y_test), # for lda_optimized
(X_train_selected_scaled, y_train, X_test_selected_scaled, y_test), # for qda
(X_train_scaled, y_train, X_test_scaled, y_test), # for svm
(X_train_scaled, y_train, X_test_scaled, y_test), # for svm_tuned
(X_train, y_train, X_test, y_test), # for adaboost
(X_train, y_train, X_test, y_test), # for adaboost_tuned
(X_train, y_train, X_test, y_test), # for gbc
(X_train, y_train, X_test, y_test), # for gbc_tuned
(X_train, y_train, X_test, y_test), # for xgb
(X_train, y_train, X_test, y_test), # for xgb_tuned
(X_train_scaled, y_train, X_test_scaled, y_test), # for ann_model
(X_train_scaled, y_train, X_test_scaled, y_test), # for ann_model_tuned
(X_train_scaled, y_train, X_test_scaled, y_test), # for ann_model_nadam
(X_train_scaled, y_train, X_test_scaled, y_test), # for ann_model_nadam_tuned
]
# Model class for reference in the DataFrame
model_class = [
'Logistic Regression',
'Logistic Regression Feature Selected',
'Logistic Regression Feature Selected Poly',
'Logistic Regression Feature Selected Poly Optimized',
'Decision Tree',
'Decision Tree Tuned',
'Decision Tree Tuned Feature Selected',
'Random Forest',
'Random Forest Weighted',
'Random Forest Tuned',
'K-Nearest Neighbor (KNN)',
'KNN tuned',
'Linear Discriminant Analysis (LDA)',
'LDA Optimized',
'Quadratic Discriminant Analysis (QDA)',
'Support Vector Machine (SVM)',
'SVM Tuned',
'AdaBoost Classifier (ABC)',
'ABC Tuned',
'Gradient Boosting Classifier (GBC)',
'GBC Tuned',
'XGBoost Classifier (XGB)',
'XGB Tuned',
'Artificial Neural Network (ANN)',
'ANN Tuned',
'ANN with Nadam Optimizer',
'ANN with Nadam Optimizer Tuned'
]
# Model names
model_names = [
'lg',
'lg_selected',
'lg_selected_poly',
'lg_selected_poly_optimized',
'dt',
'dt_tuned',
'dt_tuned_selected',
'rf_estimator',
'rf_estimator_weighted',
'rf_estimator_tuned',
'knn',
'knn_tuned',
'lda',
'lda_optimized',
'qda',
'svm',
'svm_tuned',
'adaboost',
'adaboost_tuned',
'gbc',
'gbc_tuned',
'xgb',
'xgb_tuned',
'ann_model',
'ann_model_tuned',
'ann_model_nadam',
'ann_model_nadam_tuned'
]
# Custom thresholds for specific models (if any)
custom_thresholds = {
'lg_selected_poly_optimized': 0.35,
'lda_optimized': 0.2
}
# Call function for recall on class 1 comparison frame
recall_df = compile_recalls_class1 (models, model_datasets, model_class, model_names, custom_thresholds)
recall_df
131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step
| Model Class | Model Name | Recall_Train_Class_1 | Recall_Test_Class_1 | |
|---|---|---|---|---|
| 0 | Logistic Regression | lg | 0.605769 | 0.588235 |
| 1 | Logistic Regression Feature Selected | lg_selected | 0.608173 | 0.593838 |
| 2 | Logistic Regression Feature Selected Poly | lg_selected_poly | 0.683894 | 0.644258 |
| 3 | Logistic Regression Feature Selected Poly Opti... | lg_selected_poly_optimized | 0.766827 | 0.725490 |
| 4 | Decision Tree | dt | 1.000000 | 0.624650 |
| 5 | Decision Tree Tuned | dt_tuned | 0.890625 | 0.815126 |
| 6 | Decision Tree Tuned Feature Selected | dt_tuned_selected | 0.890625 | 0.815126 |
| 7 | Random Forest | rf_estimator | 1.000000 | 0.677871 |
| 8 | Random Forest Weighted | rf_estimator_weighted | 1.000000 | 0.641457 |
| 9 | Random Forest Tuned | rf_estimator_tuned | 0.842548 | 0.789916 |
| 10 | K-Nearest Neighbor (KNN) | knn | 0.668269 | 0.577031 |
| 11 | KNN tuned | knn_tuned | 1.000000 | 0.806723 |
| 12 | Linear Discriminant Analysis (LDA) | lda | 0.670673 | 0.658263 |
| 13 | LDA Optimized | lda_optimized | 0.706731 | 0.703081 |
| 14 | Quadratic Discriminant Analysis (QDA) | qda | 0.647837 | 0.630252 |
| 15 | Support Vector Machine (SVM) | svm | 0.705529 | 0.635854 |
| 16 | SVM Tuned | svm_tuned | 0.885817 | 0.784314 |
| 17 | AdaBoost Classifier (ABC) | adaboost | 0.694712 | 0.591036 |
| 18 | ABC Tuned | adaboost_tuned | 0.782452 | 0.745098 |
| 19 | Gradient Boosting Classifier (GBC) | gbc | 0.737981 | 0.619048 |
| 20 | GBC Tuned | gbc_tuned | 0.777644 | 0.719888 |
| 21 | XGBoost Classifier (XGB) | xgb | 1.000000 | 0.714286 |
| 22 | XGB Tuned | xgb_tuned | 0.917067 | 0.815126 |
| 23 | Artificial Neural Network (ANN) | ann_model | 0.872596 | 0.806723 |
| 24 | ANN Tuned | ann_model_tuned | 0.955529 | 0.837535 |
| 25 | ANN with Nadam Optimizer | ann_model_nadam | 0.959135 | 0.829132 |
| 26 | ANN with Nadam Optimizer Tuned | ann_model_nadam_tuned | 0.955529 | 0.837535 |
Observations:
# Call function for macro averages comparison frame
macro_averages_df = compile_macro_averages (models, model_datasets, model_class, model_names, custom_thresholds)
macro_averages_df
131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 2ms/step
| Model Class | Model Name | Precision_Train | Precision_Test | Recall_Train | Recall_Test | Accuracy_Train | Accuracy_Test | |
|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | lg | 0.808072 | 0.823536 | 0.772196 | 0.768960 | 0.872244 | 0.877517 |
| 1 | Logistic Regression Feature Selected | lg_selected | 0.806695 | 0.821037 | 0.772799 | 0.770713 | 0.871764 | 0.876957 |
| 2 | Logistic Regression Feature Selected Poly | lg_selected_poly | 0.840389 | 0.843289 | 0.814702 | 0.798369 | 0.893337 | 0.890940 |
| 3 | Logistic Regression Feature Selected Poly Opti... | lg_selected_poly_optimized | 0.822000 | 0.817395 | 0.843144 | 0.824660 | 0.889022 | 0.884228 |
| 4 | Decision Tree | dt | 1.000000 | 0.822869 | 1.000000 | 0.784372 | 1.000000 | 0.880313 |
| 5 | Decision Tree Tuned | dt_tuned | 0.783518 | 0.756601 | 0.870762 | 0.829296 | 0.858821 | 0.837808 |
| 6 | Decision Tree Tuned Feature Selected | dt_tuned_selected | 0.783518 | 0.756601 | 0.870762 | 0.829296 | 0.858821 | 0.837808 |
| 7 | Random Forest | rf_estimator | 1.000000 | 0.881806 | 1.000000 | 0.822863 | 1.000000 | 0.909955 |
| 8 | Random Forest Weighted | rf_estimator_weighted | 1.000000 | 0.877170 | 1.000000 | 0.805354 | 1.000000 | 0.903803 |
| 9 | Random Forest Tuned | rf_estimator_tuned | 0.810251 | 0.795658 | 0.867831 | 0.841149 | 0.883030 | 0.871924 |
| 10 | K-Nearest Neighbor (KNN) | knn | 0.898080 | 0.833196 | 0.822159 | 0.766503 | 0.914669 | 0.880313 |
| 11 | KNN tuned | knn_tuned | 1.000000 | 0.868092 | 1.000000 | 0.875758 | 1.000000 | 0.917226 |
| 12 | Linear Discriminant Analysis (LDA) | lda | 0.779200 | 0.782738 | 0.788630 | 0.785106 | 0.859540 | 0.861298 |
| 13 | LDA Optimized | lda_optimized | 0.776402 | 0.782187 | 0.801270 | 0.802274 | 0.858102 | 0.861857 |
| 14 | Quadratic Discriminant Analysis (QDA) | qda | 0.781268 | 0.791434 | 0.780505 | 0.777041 | 0.860259 | 0.865213 |
| 15 | Support Vector Machine (SVM) | svm | 0.862024 | 0.838560 | 0.830160 | 0.793469 | 0.905081 | 0.888143 |
| 16 | SVM Tuned | svm_tuned | 0.843002 | 0.815239 | 0.899196 | 0.847433 | 0.907239 | 0.885347 |
| 17 | AdaBoost Classifier (ABC) | adaboost | 0.884851 | 0.874536 | 0.831188 | 0.781891 | 0.913231 | 0.896532 |
| 18 | ABC Tuned | adaboost_tuned | 0.797639 | 0.792188 | 0.839280 | 0.822584 | 0.873442 | 0.869128 |
| 19 | Gradient Boosting Classifier (GBC) | gbc | 0.912971 | 0.880497 | 0.857463 | 0.795897 | 0.929291 | 0.902125 |
| 20 | GBC Tuned | gbc_tuned | 0.831174 | 0.820917 | 0.850948 | 0.823606 | 0.895014 | 0.885906 |
| 21 | XGBoost Classifier (XGB) | xgb | 1.000000 | 0.907154 | 1.000000 | 0.845263 | 1.000000 | 0.923937 |
| 22 | XGB Tuned | xgb_tuned | 0.839783 | 0.804668 | 0.910031 | 0.854803 | 0.905801 | 0.878635 |
| 23 | Artificial Neural Network (ANN) | ann_model | 0.811529 | 0.796421 | 0.879711 | 0.847806 | 0.883988 | 0.872483 |
| 24 | ANN Tuned | ann_model_tuned | 0.840745 | 0.803163 | 0.925220 | 0.862513 | 0.906999 | 0.877517 |
| 25 | ANN with Nadam Optimizer | ann_model_nadam | 0.856862 | 0.816890 | 0.934208 | 0.865649 | 0.919223 | 0.887584 |
| 26 | ANN with Nadam Optimizer Tuned | ann_model_nadam_tuned | 0.840745 | 0.803163 | 0.925220 | 0.862513 | 0.906999 | 0.877517 |
Observations:
# Binary metrics for class 1 comparison frame
binary_metrics_df = compile_binary_metrics_class1 (models, model_datasets, model_class, model_names, custom_thresholds)
binary_metrics_df
131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step 131/131 [==============================] - 0s 1ms/step 56/56 [==============================] - 0s 1ms/step
| Model Class | Model Name | Train_Precision_Class_1 | Test_Precision_Class_1 | Train_Recall_Class_1 | Test_Recall_Class_1 | Train_F1_Class_1 | Test_F1_Class_1 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | lg | 0.710860 | 0.744681 | 0.605769 | 0.588235 | 0.654121 | 0.657277 |
| 1 | Logistic Regression Feature Selected | lg_selected | 0.707692 | 0.738676 | 0.608173 | 0.593838 | 0.654169 | 0.658385 |
| 2 | Logistic Regression Feature Selected Poly | lg_selected_poly | 0.757656 | 0.771812 | 0.683894 | 0.644258 | 0.718888 | 0.702290 |
| 3 | Logistic Regression Feature Selected Poly Opti... | lg_selected_poly_optimized | 0.703418 | 0.703804 | 0.766827 | 0.725490 | 0.733755 | 0.714483 |
| 4 | Decision Tree | dt | 1.000000 | 0.735974 | 1.000000 | 0.624650 | 1.000000 | 0.675758 |
| 5 | Decision Tree Tuned | dt_tuned | 0.598063 | 0.565049 | 0.890625 | 0.815126 | 0.715596 | 0.667431 |
| 6 | Decision Tree Tuned Feature Selected | dt_tuned_selected | 0.598063 | 0.565049 | 0.890625 | 0.815126 | 0.715596 | 0.667431 |
| 7 | Random Forest | rf_estimator | 1.000000 | 0.840278 | 1.000000 | 0.677871 | 1.000000 | 0.750388 |
| 8 | Random Forest Weighted | rf_estimator_weighted | 1.000000 | 0.838828 | 1.000000 | 0.641457 | 1.000000 | 0.726984 |
| 9 | Random Forest Tuned | rf_estimator_tuned | 0.662571 | 0.646789 | 0.842548 | 0.789916 | 0.741799 | 0.711223 |
| 10 | K-Nearest Neighbor (KNN) | knn | 0.874214 | 0.765799 | 0.668269 | 0.577031 | 0.757493 | 0.658147 |
| 11 | KNN tuned | knn_tuned | 1.000000 | 0.784741 | 1.000000 | 0.806723 | 1.000000 | 0.795580 |
| 12 | Linear Discriminant Analysis (LDA) | lda | 0.641379 | 0.650970 | 0.670673 | 0.658263 | 0.655699 | 0.654596 |
| 13 | LDA Optimized | lda_optimized | 0.628205 | 0.640306 | 0.706731 | 0.703081 | 0.665158 | 0.670227 |
| 14 | Quadratic Discriminant Analysis (QDA) | qda | 0.650181 | 0.673653 | 0.647837 | 0.630252 | 0.649007 | 0.651230 |
| 15 | Support Vector Machine (SVM) | svm | 0.795393 | 0.764310 | 0.705529 | 0.635854 | 0.747771 | 0.694190 |
| 16 | SVM Tuned | svm_tuned | 0.716229 | 0.686275 | 0.885817 | 0.784314 | 0.792047 | 0.732026 |
| 17 | AdaBoost Classifier (ABC) | adaboost | 0.842566 | 0.844000 | 0.694712 | 0.591036 | 0.761528 | 0.695222 |
| 18 | ABC Tuned | adaboost_tuned | 0.652305 | 0.650367 | 0.782452 | 0.745098 | 0.711475 | 0.694517 |
| 19 | Gradient Boosting Classifier (GBC) | gbc | 0.888567 | 0.850000 | 0.737981 | 0.619048 | 0.806303 | 0.716370 |
| 20 | GBC Tuned | gbc_tuned | 0.718889 | 0.711911 | 0.777644 | 0.719888 | 0.747113 | 0.715877 |
| 21 | XGBoost Classifier (XGB) | xgb | 1.000000 | 0.882353 | 1.000000 | 0.714286 | 1.000000 | 0.789474 |
| 22 | XGB Tuned | xgb_tuned | 0.701932 | 0.658371 | 0.917067 | 0.815126 | 0.795206 | 0.728411 |
| 23 | Artificial Neural Network (ANN) | ann_model | 0.657609 | 0.644295 | 0.872596 | 0.806723 | 0.750000 | 0.716418 |
| 24 | ANN Tuned | ann_model_tuned | 0.693717 | 0.650000 | 0.955529 | 0.837535 | 0.803842 | 0.731946 |
| 25 | ANN with Nadam Optimizer | ann_model_nadam | 0.724796 | 0.678899 | 0.959135 | 0.829132 | 0.825660 | 0.746532 |
| 26 | ANN with Nadam Optimizer Tuned | ann_model_nadam_tuned | 0.693717 | 0.650000 | 0.955529 | 0.837535 | 0.803842 | 0.731946 |
Observations:
# Saving the Decision Tree Tuned Feature Selected model
model_path = '/content/drive/My Drive/Final Models/dt_tuned_selected.joblib'
joblib.dump(dt_tuned_selected, model_path)
# Saving the ANN with Nadam Optimizer Tuned model
model_path = '/content/drive/My Drive/Final Models/ann_model_nadam_tuned'
ann_model_tuned.save(model_path)
# Saving the KNN Tuned model
model_path = '/content/drive/My Drive/Final Models/knn_tuned.joblib'
joblib.dump(dt_tuned_selected, model_path)
Conducting a preliminary cost-benefit analysis for the top model candidates - ANN, DT, XGBoost, and KNN - can provide an initial approximation of the costs and benefits associated with each model. This exploratory analysis aims to inform our solution design decision.
Development and Implementation Costs:
Maintenance Costs:
Performance and Accuracy:
Business Impact:
In conclusion, while ANNs offer high accuracy, their complexity and costs are significant. The Decision Tree Tuned Feature Selected, with its robust performance, lower costs, and regulatory compliance capabilities, emerges as a well-suited solution for our business problem.
1. Data Collection Costs
One-Time Expense:
2. Model Training and Maintenance Costs (For One Bank - 5,690 entries)
3. Deployment Costs and Model Improvement
4. Precision-Related Costs
Total Cost:
The model can correctly identify 82 out of 100 actual defaulters. If the bank can act on these identifications effectively, it's reasonable to assume a reduction in the default rate. An 82% recall might not translate to an 82% reduction in defaults, as not all defaults can be prevented even when identified. Let's estimate that effective actions can reduce defaults by 10%, a conservative estimate.
1. Savings:
Savings for One Bank (5,690 entries):
Average loan amount: 18,607 dollars.
Existing Total Annual Loss from Defaults (20% default rate):
= 18,607 × 20% × 5,690 = 21,116,630 dollars
New Annual Loss from Defaults (18% default rate):
= 18,607 × 18% × 5,690 = 19,047,710 dollars
Total Potential Savings for One Bank:
= 21,116,630 − 19,047,710 = 2,068,920 dollars
Thus, for one bank with 5,690 loan entries, the potential annual savings would be approximately 2,068,920 dollars, in a conservative estimate. If the bank operates multiple branches, these saving figures could multiply.
2. Customer Trust and Regulatory Compliance:
Besides the direct financial benefits, the model can contribute to improving the bank's risk management processes, customer trust, and regulatory compliance.
Net Savings for One Bank:
1. Comparison of various techniques and their relative performance:
2.Cost-Benefit analysis and solution design:
3. Final solution design: